Actes du 1er atelier Valorisation et Analyse des Données de la Recherche (VADOR) Classification of Keyphrases from Scientific Publications using WordNet and Word Embeddings Davide Buscaldi 1 , Simon David Hernandez 1 , Thierry Charnois 1 Laboratoire d’Informatique de Paris Nord, CNRS (UMR 7030) Université Paris 13, Sorbonne Paris Cité, Villetaneuse, France {davide.buscaldi,hernandez-perez,thierry.charnois}@lipn.univ-paris13.fr ABSTRACT. The ScienceIE task at SemEval-2017 introduced an epistemological classification of keyphrases in scientific publications, suggesting that research activities revolve around the key concepts of process (methods and systems), material (data and physical resources) and task. In this paper we present a method for the classification of keyphrases according to the ScienceIE classification, using WordNet and word embeddings derived features. The method outperforms the best system at SemEval-2017, although our experiments highlighted some issues with the collection. RÉSUMÉ. Dans le contexte du challenge ScienceIE à SemEval-2017, ses organisateurs ont intro- duit une classification des phrases clés dans les publications scientifiques. Selon leur hypothèse, les activités de recherche tournent autour des concepts clés de “process" (methodes, systèmes), “material" (ressources matériellles, données, produits) et “task" (problèmes, activités à pour- suivre). Dans cet article, nous présentons une méthode pour la classification des phrases clés selon la classification donné par ScienceIE, en utilisant des caractéristiques dérivées à partir de WordNet et de “word embeddings". La méthode proposée dépasse le meilleur système au SemEval-2017; toutefois, nos expériences ont mis en évidence certains problèmes d’annotation avec la collection. KEYWORDS: Information Extraction, Text Mining on Scientific Literature, Keyphrase extraction. MOTS-CLÉS : Extraction de mots clés, extraction d’information, fouille de textes sciéntifiques. 50 Actes du 1er atelier Valorisation et Analyse des Données de la Recherche (VADOR) 2 HSP. Volume x – no y/2017 1. Introduction Nowadays, the number of scientific publications is continuously growing, in all disciplines. According to (Bjork et al., 2009), 1.35 million articles were published in indexed journals in the single year 2006, and the growth rate in the number of scientific publications has been estimated by (Larsen, Von Ins, 2010) to be between 2.2% and 9% for journals and between 1.6% and 14% for conferences (depending on the disciplines) in the decade 1997-2007. It is becoming more and more difficult to search some informations required to write scientific papers, review the work of other researchers, or looking for expert. Usually this kind of search involves checking the originality of an idea or a method. Current search engines dedicated to the exploration of scientific literature, such as Google scholar1 and Scopus2 , are based on text-based search, author and citation graphs. Recent works from the semantic web, scientom- etry and natural language processing communities have been aimed to improve the access to scientific literature (Osborne, Motta, 2015; Wolfram, 2016), and some initia- tives have been started to gather researchers around this problem, like the SAVE-SD3 workshops and the ScienceIE task (Augenstein et al., 2017) at SemEval20174 . In particular, the ScienceIE task was focused on extracting keyphrases and rela- tions between them, relying on the hypothesis that the ability of correctly recognis- ing these semantic items in text will help in tasks related to the process of scientific publishing, such as to recommend articles to readers, highlight missing citations to authors, identify potential reviewers for submissions, and analyse research trends over time. The hypothesis made by the organizers is that some concepts, notably PRO- CESS, TASK and MATERIAL, are cardinal in scientific works, since they allow to answer questions like: “which papers addressed a Task using variants of some Process ?". In their vision, Processes correspond to methods and equipments and Materials to corpora and physical items. An example of text labelled with these concept is shown in Figure 1 (Augenstein et al., 2017). Figure 1. Example of annotation of a scientific document with ScienceIE concepts and relations. In this paper, we propose a method to classify candidate terms into the three cat- egories defined in the ScienceIE challenge, using surface features combined with WordNet-based features and word embeddings. This method outperforms the best 1. http://scholar.google.com 2. http://www.scopus.com 3. http://cs.unibo.it/save-sd/2017/index.html 4. https://scienceie.github.io 51 Actes du 1er atelier Valorisation et Analyse des Données de la Recherche (VADOR) !!!! Short title too long !!!! 3 result obtained at ScienceIE. In the remainder of this paper, we describe the method and the features used in Section 2, then we show the obtained results in Section 3, and finally we draw some conclusions in Section 4 2. Proposed Method The method we propose in this paper is based on Support Vector Machines (SVM), in particular the nu-SVM implementation by (Chang, Lin, 2011). SVMs are well known maximum margin classifiers; we chose them because of their robustness with regard to problems with a large number of features. Please note that the method we are describing in this paper only shares part of the WordNet-based features with the one we used to participate to the task (Hernandez et al., 2017). 2.1. Base Features The base features are constituted by all the {3,4,5}-prefixes and suffixes of keyphrases that appeared in the training set with frequency greater than 10. For in- stance, from the keyphrase “information extraction" we can identify the following features: inf, info, infor as prefixes and ction, tion, ion as suffixes. Together with the prefixes and suffixes, we have considered the following features: – capitalization of the keyphrase (binary); – uppercase ratio, calculated as number of uppercase characters divided by num- ber of characters in the keyphrase; – number of digits in the keyphrase; – number of dashes; – number of words. 2.2. WordNet-based Features WordNet (Miller, 1995) is a well known lexical database for the English language. In WordNet, word senses are represented as synsets, or “set of synonyms", which may be connected to other synsets by some relationship. Some of the most common re- lationships are meronymy (part-of) and hyperonymy (is-a). We define a synpath as the list of synsets connecting a sense of a target word to the root of the hierarchy in WordNet, following the hyperonymy relation. In Figure 2 we show the synpaths cor- responding to the three senses of the word extraction in WordNet 3.0. The definitions of the senses are as follows: 1. extraction#1:the process of obtaining something from a mixture or compound by chemical or physical or mechanical means; 2. extraction#2: properties attributable to your ancestry; 3. extraction#3: the action of taking out something (especially using effort or force). 52 Actes du 1er atelier Valorisation et Analyse des Données de la Recherche (VADOR) 4 HSP. Volume x – no y/2017 Figure 2. Example of synpaths for the word “extraction" in WordNet 3.0 (simplified by removing some synsets). From Figure 2 it can be observed that the synset process is in the synpath (pro- cess, physical_entity) of extraction#1, which seems an important clue to classify this keyword as a PROCESS, according to the ScienceIE classification. Therefore, we supposed that synpaths can be effectively used as features to predict the category of a keyword. Given the number of synsets in WordNet (more than 117, 000), we opted to select only a subset of those synset, in particular by limiting the scope to the synsets that are particularly distinctive for each of the three classes. We calculated, on the training corpus of ScienceIE, the probability p(s|C) for each synset with re- spect to class C. Subsequently, we ordered in decreasing order, for each class, the p(s|Cj )+p(s|Ck ) synsets according to the difference p(s|Ci ) − 2 . We show in Table 1 the most distinctive synsets for each category. The semantic correlation between the MATERIAL category and its distinctive synsets is particularly evident. Table 1. Top 5 distinctive synsets for each category. P ROCESS M AT ERIAL T ASK psychological_f eature.n.01 physical_entity.n.01 science.n.01 event.n.01 object.n.01 possession.n.02 abstraction.n.06 whole.n.02 natural_science.n.01 act.n.02 artif act.n.01 question.n.02 cognition.n.01 matter.n.03 subject.n.01 53 Actes du 1er atelier Valorisation et Analyse des Données de la Recherche (VADOR) !!!! Short title too long !!!! 5 We arbitrarily selected the top 20 distinctive synsets for each category and we used them to extract some binary features5 . These features are true for a token if they are present in any of the hypernym paths connecting the noun synsets to the root synset. Note that these features were added only for the nouns, since there is no hierarchy for the other lexical categories (if we exclude verbs, whose hierarchy is in any case very shallow, if compared to nouns). If the keyphrase was composed by more terms, then we searched the synpaths for the rightmost noun in the keyphrase. 2.3. Word Embeddings Features Word embeddings, as introduced by (Bengio et al., 2006), are vector representa- tions of words that capture a certain number of syntactic and semantic relationships, generated with neural networks. In this work, we used the pre-trained vectors trained on 100 billion words from a Google News dataset (Mikolov et al., 2013). The vocab- ulary size is 3 million words and the vector length is 300. One of the problem we had to solve to include embeddings was to deal with keyphrases composed by more than one term: vectors are linked to single words (or, in some cases to compound words or terms). (De Boom et al., 2016) showed that it’s possible to exploit the properties of embeddings to represent sentences with the average, the max, or the min of the vectors of the composing words. We chose to use the max. 3. Experiments and Results We carried out our experiments on the ScienceIE dataset6 , consisting in a set of 450 articles collected from ScienceDirect, distributed among the domains Computer Science, Material Sciences and Physics. The training set consists of 350 documents, while the test set consists of 100 documents. The organizers also distributed 50 doc- uments as development set, but we didn’t use these data. The task consisted in three sub-tasks: – A) Mention-level keyphrase identification; – B) Mention-level keyphrase classification; – C) Mention-level semantic relation extraction between keyphrases with the same keyphrase types. Relation types are HYPONYM-OF and SYNONYM-OF. We consider in this paper only sub-task B), while for the evaluation, we refer to the evaluation scenario in which the text is manually annotated and keyphrase boundaries are given (Augenstein et al., 2017). 5. full list: https://github.com/snovd/corpus-data/blob/master/SemEval2017Task10/SynsetsRelatedToTrainingData.txt 6. https://scienceie.github.io/resources.html 54 Actes du 1er atelier Valorisation et Analyse des Données de la Recherche (VADOR) 6 HSP. Volume x – no y/2017 In Table 2 we show the results obtained with different combination of features, compared to the best system at the SemEval 2017 ScienceIE (B subtask, with the evaluation scenario where the keyphrase boundaries are given). Table 2. F-measure obtained for each test configuration, compared with the best system at SemEval 2017 ScienceIE. P ROCESS M AT ERIAL T ASK all Base .577 .726 .322 .619 Base + W N .728 .750 .325 .700 Allf eatures .710 .778 .381 .716 Base + Embeddings .701 .764 .407 .701 best@SemEval2017 .660 .760 .280 .670 From these results and the confusion matrices in Figure 3 it can be seen that Word- Net features are very helpful in discriminating the MATERIAL from the PROCESS class, while the word embeddings features had a positive impact on the TASK class, which was the most difficult one. a) Base features b) Base features + WordNet c) Base features + WordNet +embeddings d) Base features + embeddings Figure 3. Confusion matrices for the 4 configurations tested. 55 Actes du 1er atelier Valorisation et Analyse des Données de la Recherche (VADOR) !!!! Short title too long !!!! 7 The confusion matrices show also that TASK is often confused with PROCESS, which in turn seem to be too predominant, indicating a bias in the collection to- wards this class. An analysis of the annotated collection showed certain inconsis- tencies in annotations that may be at the origin of the errors: for instance, in file 2212667814000732.ann, we found a conflicting annotation for “synthetic as- sessment method": alone is annotated as PROCESS, but the keyphrase “synthetic assessment method based on cloud theory" is annotated as TASK, which seems odd. In file S2212671612002351.ann, we found that “position estimation method" is labelled as TASK, when it should instead be a process. 4. Conclusions We developed a method to classify keyphrases into a predefined set of categories provided by the ScienceIE task at SemEval-2017. This method integrates external knowledge, acquired either from an existing resource like WordNet or learned from a large corpus of text and encoded using word embeddings, as features for a SVM clas- sifier. The obtained results outperform those obtained by the best system presented at SemEval-2017. Our method presents margins of improvement, since some parame- ters were chosen arbitrarily and further investigation is needed to discover the optimal ones. We plan to exploit the domain of the document as an additional feature, sup- posing that keyphrase styles may vary depending on the domain. The experiments also highlighted some problems with the ScienceIE collection: on one side one of the classes seems underrepresented and our analysis exposed a certain number of annota- tion errors which may require a manual re-annotation. Acknowledgements This work has been partly supported by the program “Investissements d’Avenir" overseen by the French National Research Agency, ANR-10-LABX-0083 (Labex EFL). References Augenstein I., Das M. K., Riedel S., Vikraman L. N., McCallum A. (2017, August). SemEval 2017 Task 10: ScienceIE - Extracting Keyphrases and Relations from Scientific Publica- tions. In Proceedings of the international workshop on semantic evaluation. Vancouver, Canada, Association for Computational Linguistics. Bengio Y., Schwenk H., Senécal J.-S., Morin F., Gauvain J.-L. (2006). Neural probabilistic language models. In D. E. Holmes, L. C. Jain (Eds.), Innovations in machine learning: Theory and applications, pp. 137–186. Berlin, Heidelberg, Springer Berlin Heidelberg. Retrieved from http://dx.doi.org/10.1007/3-540-33486-6_6 Bjork B.-C., Roos A., Lauri M. (2009). Scientific journal publishing: yearly volume and open access availability. Information Research: An International Electronic Journal, Vol. 14, No. 1. 56 Actes du 1er atelier Valorisation et Analyse des Données de la Recherche (VADOR) 8 HSP. Volume x – no y/2017 Chang C.-C., Lin C.-J. (2011). LIBSVM: A library for support vector machines. ACM Trans- actions on Intelligent Systems and Technology, Vol. 2, pp. 27:1–27:27. (Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm) De Boom C., Van Canneyt S., Demeester T., Dhoedt B. (2016, September). Representation learning for very short texts using weighted word embedding aggregation. Pattern Recogn. Lett., Vol. 80, No. C, pp. 150–156. Retrieved from https://doi.org/10.1016/j.patrec.2016.06 .012 Hernandez S. D., Buscaldi D., Charnois T. (2017, August). LIPN at SemEval-2017 Task 10: Filtering Candidate Keyphrases from Scientific Publications with Part-of-Speech Tag Se- quences to Train a Sequence Labeling Model. In Proceedings of the international workshop on semantic evaluation. Vancouver, Canada, Association for Computational Linguistics. Larsen P. O., Von Ins M. (2010). The rate of growth in scientific publication and the decline in coverage provided by science citation index. Scientometrics, Vol. 84, No. 3, pp. 575–603. Mikolov T., Sutskever I., Chen K., Corrado G. S., Dean J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Miller G. A. (1995). Wordnet: a lexical database for english. Communications of the ACM, Vol. 38, No. 11, pp. 39–41. Osborne F., Motta E. (2015). Klink-2: Integrating multiple web sources to generate semantic topic networks. In Proceedings of the 14th international conference on the semantic web - iswc 2015 - volume 9366, pp. 408–424. New York, NY, USA, Springer-Verlag New York, Inc. Retrieved from http://dx.doi.org/10.1007/978-3-319-25007-6_24 Wolfram D. (2016). Bibliometrics, information retrieval and natural language processing: Natural synergies to support digital library research. In Proceedings of the joint workshop on bibliometric-enhanced information retrieval and natural language processing for digital libraries (BIRNDL), pp. 6–13. Retrieved from http://ceur-ws.org/Vol-1610/paper1.pdf 57