Learning Embeddings from Scientific Corpora using Lexical, Grammatical and Semantic Information Andres Garcia-Silva Ronald Denaux Jose Manuel Gomez-Perez agarcia@expertsystem.com rdenaux@expertsystem.com jmgomez@expertsystem.com Expert System Expert System Expert System Madrid, Spain Madrid, Spain Madrid, spain ABSTRACT machine-readability among other benefits. In addition, publishers Natural language processing can assist scientists to leverage the in- have started releasing knowledge graphs such as Springer nature creasing amount of information contained in scientific bibliography. SciGraph2 , an open linked data graph about publications from The current trend, based on deep learning and embeddings, uses the editorial group and cooperating partners, and the Literature representations at the (sub)word level that require large amounts of Graph in Semantic scholar [1]. Nevertheless the knowledge of the training data and neural architectures with millions of parameters scholar communications is still mainly text which is difficult to to learn successful language models, like BERT. However, these process by software agents. Research objects shed some light on the representations may not be well suited for the scientific domain, publication content with the semantic annotations, however they where it is common to find complex terms, e.g. multi-word, with are user-generated and scarce in existing repositories [14]. Semantic a domain-specific meaning in a very specific context. In this pa- scholar, on the other hand, uses Natural Language Processing to per we propose an approach based on a linguistic analysis of the extract keywords and identify topics relevant for the publications. corpus using a knowledge graph to learn representations that can In fact NLP technology is progressing at a fast pace thanks to unambiguously capture such terms and their meaning. We learn word embeddings [19] and pre-trained language models based on embeddings from different linguistic annotations on the text and transformers[30] that have allowed to improve the state of the evaluate them through a classification task over the SciGraph taxon- art on different evaluation tasks [12, 24]. Most of existing word omy, showing that our representations outperform (sub)word-level embeddings and pre-trained language models use sequences of approaches. characters, word pieces, and words in a sentence as their main input. However, in the scientific domain there are terms consisting of more CCS CONCEPTS than one word that have a domain-specific semantics. For example the meaning of a term such as Molecularly imprinted polymer 3 can • Computing methodologies → Natural language process- be hardly identified from the single words, word pieces or other ing; Neural networks; Semantic networks; Machine learning ap- character-based representations, and hence the neural models used proaches. for NLP need to learn the relation between the single words, word pieces or characters, requiring complex architectures with a high KEYWORDS number of parameters to optimize, and a huge amount of training NLP, neural networks, convolutional neural networks, embeddings, data. text classification Scientific terminology is domain specific and scarce in a general corpus and hence accumulating the necessary amount of evidence 1 INTRODUCTION from documents to identify it a as single entity with a specific mean- Nowadays scholarly communications are evolving, thanks to the ing is very unlikely if we analyse single words and sub words rep- effort of research communities, funding agencies and publishers, resentations. On the other hand, precisely in the scientific domain, beyond the conventional delivery method based on documents the amount of structured resources, including catalogs, taxonomies to gain better visibility, reuse capabilities and to foster a broader and knowledge graphs with specific terminology and their corre- data accessibility[8]. The list of enhancements is wide and include sponding definitions is available. Thus, the question raises, what is the availability of supporting material such as code 1 and research the minimum information unit or their combination thereof, which software[29], the use of Digital Object Identifiers to favor reusability allows for efficient representations in vector form and at the same and proper credit to authors, the emergence of specialized academic time can be linked to a semantically significant concept? search engines such as semantic scholar, and the adoption of the In this paper we propose to generate embeddings using surface FAIR principles [32] to make data findable, accessible, interoperable forms, lemmas and concepts that are able to represent complex and reusable. terms consisting of more than one word. The linguistic information Aligned with the FAIR principles, in particular with the goal of is the result of applying a linguistic analysis that relies on a knowl- assisting humans and machines in managing data, research objects edge graph where linguistic knowledge is encoded. The linguistic [2, 3, 13, 33] encapsulate and annotate semantically all the resources analysis performs a grammatical, syntactical and semantic analysis involved in a research endeavour enabling data interoperability and to recognize and disambiguate terms that can consist of more than 1 see the Data Citation initiative at https://doi.org/10.25490/a97f-egyk 2 SciGraph homepage: https://www.springernature.com/gp/researchers/scigraph 3 According to Wikipedia a Molecularly imprinted polymer is a polymer that has been Copyright ©2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). processed using the molecular imprinting technique Sciknow 2019, November 19th, 2019, Los Angeles, California, USA Garcia-Silva, et al. one word. We generate embeddings from a scholarly communica- 3 LEVERAGING LEXICAL, GRAMMATICAL tions corpus for single and joined representations (surface forms, AND SEMANTIC INFORMATION lemmas, part-of-speech, and concepts). We experiment with these To learn embeddings from different linguistic annotations we use embeddings in a text classification task where the goal is to classify Vecsigrafo [11], a method to learn embedding for linguistic annota- academic publications in a topic taxonomy. tions on a text corpus. Vecsigrafo extends the Swivel algorithm [28] Our results show that using linguistic annotation embeddings to jointly learn embeddings for surface forms, lemmas, grammar helps to learn better classifiers when compared to those learned types, and concept on a corpus enriched with linguistic annota- only with words or subword embeddings. According to our exper- tions. Vecsigrafo embeddings outperformed the previous state of imentation the best approach is to use surface form and lemma the art in word and word-sense embeddings by co-training surface embeddings jointly. When surface form and lemma embeddings are form, lemma and concept embeddings as opposed to training each enriched with gramma information embeddings, like part-of-speech individually. tag embeddings, the classifier with the greatest precision is learned. In contrast to simple tokens produced by space separation tok- On the other hand, concept embeddings results were mixed proba- enization, linguistic annotations used in Vecsigrafo are based on bly due to the general-purpose annotator used in the experiments terms that are related to one or more words. Surface forms are with a limited coverage of the scientific domain vocabulary. terms as they appear in the text, and lemmas are the base form of This papers is structured as follows. Section 2 describes the these terms. Source forms and lemmas can refer to concepts in a related work and the paper contributions. Section 3 summarizes the knowledge graph. For example, table 1 shows the linguistic annota- approach to learn the embeddings for linguistic annotations. Next, tions added to a text excerpt taken from a publication. Note how at Section 4 presents the experimental work where we evaluate the the surface form level some tokens are grouped into terms like local embeddings in a text classification task. Finally section 5 presents anesthetic and phrenic nerve, and at the lemma level some surface the conclusions and future lines of work. forms such as concerns and relating are turned into their base form concern and relate. The grammar information indicates the role of the terms as nouns (N), verbs(V), noun and verb phrases (NP, VP), prepositions (P) and punctuation marks (PNT). In addition some of the terms are related to the concepts like like local anesthetic that 2 RELATED WORK is annotated with the concept en%23107824862 that is defined as An anesthetic that numbs a local area. Recent work in distributional representation of words has moved Formally, Vecsigrafo generates, from a corpus an embedding from static [6, 17, 19, 21, 28] to contextualized word embeddings space Φ = {(x, e) : x ∈ SF ∪ L ∪ G ∪ C, e ∈ Rn } where SF , L, G, and [12, 22], in an effort to generate them dynamically according to the C are sets of surface forms, lemmas, grammar types, and concepts. context and deal with phenomena like polysemy and homonymy. One of the benefits of Vecisgrafo is that concept embeddings con- A main problem with traditional words embeddings is that un- tribute to identifying the intended meaning of ambiguous terms seen words or rare words are not represented in the distributional in the corpus since the term and concept embeddings are learned space and hence considered as out-of-vocabulary (OOV) words. To jointly. To use Vecsigrafo embeddings in Φ we need to annotate the overcome the OOV problem different embedding representations target corpus with the linguistic elements used to learn the embed- have been proposed including character level used in ELMO [22], dings. Note that embeddings representing linguistic annotations character-n-grams used in FastText [5], subwords used in GPT [23] for the same term can be merged to generate a single embedding and word pieces [27] used in BERT [12]. for the term, for example, by applying vector operations such as In parallel researchers have proposed to learn jointly concepts concatenation or averaging, or dimensional reduction techniques and word embeddings as an alternative approach to cope with the like PCA or SVD. ambiguity of the language. For example Camacho-Collados et al. [9] relies on Wikipedia and Chen et al. [10] on WordNet to generate 4 EXPERIMENTAL WORK concept embeddings. Many approaches learn embeddings straight from knowledge graphs [7, 20, 25, 26], and others use linguistic In this section we describe the scholarly communication corpus annotations on a text corpus [11, 18]. used to learn the linguistic embeddings, the NLP toolkit used to In the scientific domain, Wang et al. [31] highlighted the limita- annotate the corpus, the neural network that uses the linguistic tions of general-purpose word embeddings in NLP tasks. So as to embeddings to classify the research publications, and report the deal with such limitations Beltagy et al. [4] use BERT[12] to learn evaluation results of the classifiers. embeddings from the scientific domain. In this work adopt the Vecsigrafo approach [11] to generate embeddings from a scientific 4.1 Embeddings for Scholarly Communications corpus for surface forms, lemmas and concepts. The vecsigrafo em- SciGraph [15] is a linked open data platform for the scientific do- beddings encodes linguistic information in contrast to approaches main. It contains information from the complete research process: like Beltagy et al. [4] that relies on word pieces. research projects, conferences, authors and publications, among The main contribution of this paper is a comprehensive exper- others. The knowledge graph contains more than 1 billion facts imentation in the scientific domain with Vecsigrafo embeddings about objects of interest to the scholarly domain, distributed over jointly learned from linguistic annotations and compare them with some 85 million entities described using 50 classes and more than word and subword embeddings. 250 properties. Most of the knowledge graph is available under CC Learning Embeddings from Scientific Corpora using Lexical, Grammatical and Semantic Information Sciknow 2019, November 19th, 2019, Los Angeles, California, USA Concept en%2326973 en%2377696 - en%23107824862 en%23100274160 - en%23100737313 en%23101569578 Grammar N V P NP N PNT NP N Lemma concern relate to local anesthetic toxicity , phrenic nerve blockade Surface Form concerns relating to local anesthetic toxicity , phrenic nerve blockade Token concerns relating to local anesthetic toxicity , phrenic nerve blockade Table 1: Linguistic annotations and tokens generated for the text excerpt "concerns relating to local anesthetic toxicity, phrenic nerve blockade" extracted from an actual publication. Linguistic Embeddings Total Distinct Embeddings Precision Recall F-measure annotations Generation Token 707M 1,486,848 1,486,848 Normal Distribution 0,7596 0,6775 0,7015△ Surface Form 805M 5,090,304 692,224 Optimized by CNN 0,8062 0,767 0,7806▽ Lemma 508M 4,798,313 770,048 Table 3: Evaluation results for classifiers using token-based grammar 804M 25 8 embeddings generated randomly and following the normal Concept 425M 212,365 147,456 distribution (baseline) and optimized by the convolutional Table 2: Token and linguistic annotations, and embeddings neural network (upper bound) generated from text in the title and abstract of research arti- cles and book chapters published between 2001 to 2017 and available in Scigraph. The number of distinct linguistic an- notations is different than the embeddings because we filter 4.2 Evaluation Task out articles and auxiliary verbs and apply a minimum fre- Publications in Scigraph have one or more field of research codes quency threshold. that classify the documents in 22 categories such as Mathematical Sciences, Engineering or Medical and Health Sciences. Thus, we can formulate a multi-label classification task that aims at predicting one or more of these 22 first level categories for each publication. BY 4.0 License (i.e., attribution) with the exception of abstracts and Embeddings are the natural numerical representation of text grant metadata, which are available under CC BY-NC 4.0 License for neural networks. Kim [16] shows that Convolutional neural (i.e., attribution and non-comercial) A core ontology expressed in networks CNN were fitted for text classification and his results OWL encodes the semantics of the data in the knowledge graph improved the state of the art on different text classification tasks consisting of 47 classes and 253 properties. From SciGraph we ex- and benchmarks. CNN are based on convolutional layers that slide tract publications including articles and book chapters published filters (aka kernels) across the input data and return the dot products from 2001 to 2017. We use the titles and abstracts of the publications of the elements of the filter and each fragment of the input. These to generate the corpus with roughly 3.2 million publications, 1.4 convolutions allows the network to learn features from the data, million distinct words, and 700 million tokens. alleviating the manual selection required in traditional approaches. Next we use Expert System NLP suit (Cogito) to parse the text Stacking several convolutional layers allows feature composition, and add linguistic annotations. Cogito disambiguator relies on its increasing the level of abstraction from the initial layers to the own knowledge graph called Sensigrafo, that encodes the linguistic output. knowledge in a way similar to WordNet, and applies a rule-based To learn the classifier we use an off the shelf CNN implemen- approach to disambiguation. The Sensigrafo contains about 400K, tation available in Keras, with 3 convolutional layers, 128 filters lemmas and 300K concepts interlinked via 61 relation types. Note and a 5-element window size. As corpus we use 187795 articles that we could have used any other NLP toolkit as long as it generates available in SciGraph published in 2011. To evaluate the classifiers the linguistic annotations used in this work. The corpus parsing we use ten-fold cross-validation and precision, recall and f-measure and annotations generated by Cogito are reported in table 2. as metrics. We use a vocabulary with maximum 20K entries, and For each linguistic element we learned an initial set of embed- sequences size 1000. dings with 300 dimensions using Vecsigrafo. The difference between As baseline, we train a classifier that learns from embeddings the number of learned embeddings and the linguistic annotations generated randomly following a normal distribution. As upper is due to a filter that we applied based on previous results [11]. We bound we learn a classifier that is able to optimize the embeddings filter out elements with grammar type article, punctuation mark in the learning process. The evaluation of baseline and upper bound or auxiliary verbs and generalize tokens with grammar type entity classifiers are presented in table 3. or person proper noun, replacing the original token with special tokens grammar#ENT and grammar#NPH respectively. In addition, 4.3 Classifiers using vecsigrafo embeddings to these embeddings, we learned 10 Vecsigrafo embedding spaces We train classifiers using single Vecsigrafo embeddings for each for the possible combinations of size 2 and 3 between the linguistic linguistic annotation (sf, l, c) and for the ten 2, and 3 size combi- elements sf, l, g and c. nations of (sf, l, g, c). Grammar embeddings were not evaluated Sciknow 2019, November 19th, 2019, Los Angeles, California, USA Garcia-Silva, et al. independently due to the low number of distinct grammar types Linguistic Merging Precision Recall F-Measure↓ used to annotate the terms. When using embeddings of two or Annotations three linguistic annotations two different approaches are used. The sf_l - 0,8104 0,7638 0,7818 first approach relies on a single vocabulary containing at most 20K sf_l_c - 0,8135 0,7598 0,7809 entries per each linguistic annotations in the text, and no merging l_c - 0,8102 0,7604 0,7797 operation is carried out, while in the second one embeddings are l_g_c - 0,8099 0,7592 0,7791 sf_g_c - 0.8126 0.7585 0.7790 merged using concatenation or average. Evaluations results are sf_l Avg 0,8093 0,7588 0,7787 reported in table 4. l_g_c Avg 0,8125 0,7558 0,7779 sf_l_g - 0,8144 0,7549 0,7779 4.4 Lemmas better than surface forms and sf_l_c Avg 0,8080 0,7581 0,7773 tokens sf_g_c Avg 0.8137 0.7548 0.7769 sf_l_g Concat 0,8148 0,7543 0,7765 Regarding single linguistic annotations, lemma l and surface l_c Avg 0,8040 0,7592 0,7763 form sf embeddings contribute to learn the better classifier sf_c - 0,8096 0,7549 0,7754 than using token t embeddings respectively. This shows that l_g - 0,8121 0,7498 0,7728 the classifier learning process benefits from the conflation of differ- l - 0,8035 0,7539 0,7728 ent term and word variations (sf, t) into a base form (l). However, sf_c Avg 0,8023 0,7543 0,7722 l_g Concat 0,8077 0,7472 0,7688 grouping raw tokens into terms (sf ) only generates a slight im- sf - 0,8030 0,7477 0,7684 provement in the classifier performance with respect to using only t - 0,8008 0,7491 0,7679 tokens (t). On the other hand, concept (c) embeddings performance sf_g - 0,8124 0,7387 0,7653 in this task is worst than t embeddings. The low number of c em- c - 0,7973 0,7453 0,7650 beddings (see table 2) compared to the number of tokens and the sf_g Concat 0,8101 0,7317 0,7648 other linguistic annotations affect negatively the learning process. c_g - 0,8095 0,7357 0,7629 c_g Concat 0,8076 0,7320 0,7596 The difference between concepts and tokens is consequence of lim- ited coverage of the general-purpose annotator used in a highly Table 4: Classifiers learned using vecsigrafo embeddings specialized domain as the scientific. and token embeddings (in grey row) sorted descently by F- Measure. Only the best classifier for either average or con- 4.5 Lemmas and surface forms the best catenation merging operation is reported. Italic and Bold combination font indicate the top 5 results per metric. The top value per metric is underlined To analyse the results of the different combinations of embeddings for linguistic annotations we focus on each evaluation metric. Re- garding precision the top 2 classifiers are learned from combina- tions of sf, l and g. In addition note that the common linguistic element in the top 6 classifiers is g combined either with sf or l, needs a high precision and a high recall. The combination of and in general removing g produced least precise classifiers. Thus, surface forms sf and lemmas l embeddings is at the top of precision-wise the part-of-speech information in combina- the f-measure ranking, followed by their combination with c. In tion with surface forms and lemmas is very relevant. Seman- general, concept embeddings improves the f-measure when com- tic information (c) also contributes to enhance precision when it bined with either lemmas or surface forms. However, when used in is combined with lemmas and surface forms, or with lemmas and conjunction with lemmas and surface form embeddings the perfor- grammar information. In addition, the precision of 16 classifiers mance is worse. In general, due to the low coverage of concepts in out of 22 is better than the upper bound reported in table 3, where the scientific domain the classifiers that relies only on c embeddings the embeddings are optimized in the classifier learning phase, even perform worst even when combined with grammar information. though vecsigrafo embeddings were not learned for this specific Similarly surface forms offer poor performance when combined purpose. with grammar information. The recall analysis shows a different picture since the grammar Finally note how the best classifiers were learned when the information (g) does not seem to have a decisive role on the clas- linguistic annotation embeddings are used independently which sifier performance. Surface forms and lemmas generates the contrast to the worse results achieved when merging the embed- classifier with highest recall. Nevertheless, in this analysis con- dings. cepts (c) gain more relevance always in combination with either sf or l. The combination of l and c seems to benefit recall since it is 4.6 Words and subwords presented in 3 of the top 5 classifiers. In contrast, when concepts are We also test embeddings generated from word constituents. We combined with sf the recall is lower. In general g-based embedding resorted to FastText[6] since Vecsigrafo approach was not designed combinations generate classifiers with lower recall. Note that none to generate embeddings for word constituents. We use FastText to of the classifiers reached the recall of the upper bound classifier. generate token and character-ngram embeddings, with n ranging The f-measure data shows more heterogeneous results since from 3 to 6. We use these embeddings to learn the classifiers using by definition it is the harmonic mean of precision and recall, and the same CNN architecture and evaluation procedure used in the hence the embedding combinations that generate the best f-measure experiments described above. Evaluation results, presented in table Learning Embeddings from Scientific Corpora using Lexical, Grammatical and Semantic Information Sciknow 2019, November 19th, 2019, Los Angeles, California, USA FastText the other hand, were less helpful in general mainly due to the low Precision Recall F-Measure Embeddings coverage of concepts in the scientific domain. Since part of the anal- t 0.8236 0.7493 0.7770 ysis that identify surface forms and lemmas are based on lexical t + character-ngrams 0.8255 0.7429 0.7724 and syntactical analysis the coverage was higher. Table 5: Evaluations of a classifier learned character-ngrams As future work we want evaluate the linguistic annotation em- generated with FastText. beddings on other evaluation tasks different from text classification where understanding the the glossary can have more impact like entailment and question and answering. In addition, another line of research is to evaluate the impact of the linguistic annotations when used as input representation to learn language models. 5, shows that token embeddings are better than using token and character-ngram embeddings, which is in line with our assumption ACKNOWLEDGMENTS that using subword representations could be not convinient in the scientific domain. Note that one of the benefits of using character- This research has been supported by The European Language Grid ngram embeddings is to avoid the out of the vocabulary words project funded by the European Unions Horizon 2020 research and (OOV). However, in our case, the embeddings were learned from innovation programme undergrant agreement No 825627 (ELG). the whole scigraph corpus so we do not face the OOV problem in our experiments. REFERENCES On the other hand, results in table 4 and 5 are not directly com- [1] Waleed Ammar, Dirk Groeneveld, Chandra Bhagavatula, Iz Beltagy, Miles Craw- ford, Doug Downey, Jason Dunkelberger, Ahmed Elgohary, Sergey Feldman, parable since the embeddings are generated with a different algo- Vu Ha, Rodney Kinney, Sebastian Kohlmeier, Kyle Lo, Tyler C. Murray, Hsu- rithms (FastText vs Vecsigrafo). For example FastText token em- Han Ooi, Matthew E. Peters, Joanna L. Power, Sam Skjonsberg, Lucy Lu Wang, beddings generate a better classifier than using Vecisgrafo token Christopher Wilhelm, Zheng Yuan, Madeleine van Zuylen, and Oren Etzioni. 2018. Construction of the Literature Graph in Semantic Scholar. In NAACL-HLT. embeddings, and remarkably FastText embeddings in both cases [2] S Bechhofer, I Buchan, D De Roure, P Missier, J Ainsworth, J Bhagat, P Couch, reach the highest precision of all the tested embeddings. Never- D Cruickshank, M Delderfield, I Dunlop, M Gamble, D Michaelides, S Owen, D Newman, S Sufi, and C Goble. 2013. Why linked data is not enough for scientists. theless, we can see that the f-measure of the classifier that uses Future Generation Computer Systems 29, 2 (2013), 599 – 611. https://doi.org/10. FastText character-ngram embeddings is lesser than the first 11 1016/j.future.2011.08.004 Special section: Recent advances in e-Science. results reported in table 4, including the classifier that uses only [3] K Belhajjame, O Corcho, D Garijo, J Zhao, P Missier, DR Newman, R Palma, S Bechhofer, E Garcia-Cuesta, JM Gomez-Perez, G Klyne, K Page, M Roos, JE Ruiz, S lemmas. Soiland-Reyes, L Verdes-Montenegro, D De Roure, and C Goble. [n. d.]. Workflow- Centric Research Objects: A First Class Citizen in the Scholarly Discourse. 1–12. http://ceur-ws.org/Vol-903/paper-01.pdf 5 CONCLUSIONS [4] Iz Beltagy, Arman Cohan, and Kyle Lo. 2019. SciBERT: Pretrained Contextualized Natural language processing has the potential to help scientists to Embeddings for Scientific Text. arXiv:arXiv:1903.10676 [5] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016. En- manage and get insights out of the huge amount of scholarly com- riching Word Vectors with Subword Information. arXiv preprint arXiv:1607.04606 munications available. Nowadays deep learning techniques based (2016). on word embeddings and language models have advanced the state [6] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association of the art in different NLP tasks. Nevertheless, the predominant for Computational Linguistics 5 (2017), 135–146. approach in NLP is to use word or subword representations as the [7] Antoine Bordes, Nicolas Usunier, Jason Weston, and Oksana Yakhnenko. 2013. Translating Embeddings for Modeling Multi-Relational Data. Advances input of deep neural architectures that requires large corpora to in NIPS 26 (2013), 2787–2795. https://doi.org/10.1007/s13398-014-0173-7.2 learn performing language models. However, in contrast to general- arXiv:arXiv:1011.1669v3 purpose corpora the scientific vocabulary often contains complex [8] Philip E. Bourne, Timothy W. Clark, Robert Dale, Anita de Waard, Ivan Herman, Eduard H. Hovy, and David Shotton. 2012. Improving The Future of Research terms comprising more than one word with the additional charac- Communications and e-Scholarship (Dagstuhl Perspectives Workshop 11331). teristic that these terms are very specific and only make sense in Dagstuhl Manifestos 1, 1 (2012), 41–60. https://doi.org/10.4230/DagMan.1.1.41 certain fields of knowledge (e.g., Cosmic Microwave Background [9] José Camacho-Collados, Mohammad Taher Pilehvar, and Roberto Navigli. 2016. NASARI: Integrating explicit knowledge and corpus statistics for a multilingual Radiation). Thus models using word or subword representations representation of concepts and entities. Artificial Intelligence 240 (2016), 36–64. could have problems to gather the necessary textual evidence to https://doi.org/10.1016/j.artint.2016.07.005 [10] Xinxiong Chen, Zhiyuan Liu, and Maosong Sun. 2014. A Unified Model for Word capture their meaning. Sense Representation and Disambiguation. In EMNLP. 1025–1035. To overcome the word and subword representation limitation we [11] R Denaux and JM Gomez-Perez. 2019. Vecsigrafo: Corpus-based Word-Concept propose to use embeddings based on linguistic annotations such as Embeddings-Bridging the Statistic-Symbolic Representational Gap in Natural Language Processing. To appear in Semantic Web Journal http://www.semantic- surface forms, lemmas, part-of-speech information, and concepts. web-journal.net/system/files/swj2148.pdf (2019). These embeddings are jointly learned from a corpus of scientific [12] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: communications using an existing approach called Vecsigrafo. We Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805 (2018). evaluate the linguistic annotation embeddings in a multilabel clas- [13] Andres Garcia-Silva, Jose Manuel Gomez-Perez, Raul Palma, Marcin Krystek, sification where the goal was to assign a scientific topic to each Simone Mantovani, Federica Foglini, Valentina Grande, Francesco De Leo, Ste- fano Salvi, Elisa Trasatti, Vito Romaniello, Mirko Albani, Cristiano Silvagni, publication. Our evaluations results show that lemmas help to learn Rosemarie Leone, Fulvio Marelli, Sergio Albani, Michele Lazzarini, Hazel J. better classifiers than using space-separated words and subword Napier, Helen M. Glaves, Timothy Aldridge, Charles Meertens, Fran Boler, representations based on character-ngrams. The best results were Henry W. Loescher, Christine Laney, Melissa A. Genazzio, Daniel Crawl, and Ilkay Altintas. 2019. Enabling FAIR research in Earth Science through re- achieved when lemma and surface forms were used jointly. Gram- search objects. Future Generation Computer Systems 98 (2019), 550 – 564. mar information was very useful for high precision. Concepts, on https://doi.org/10.1016/j.future.2019.03.046 Sciknow 2019, November 19th, 2019, Los Angeles, California, USA Garcia-Silva, et al. [14] Jose Manuel Gomez-Perez, Raul Palma, and Andres Garcia-Silva. 2017. Towards URL https://s3-us-west-2. amazonaws. com/openai-assets/research- a human-machine scientific partnership based on semantically rich research covers/languageunsupervised/language understanding paper. pdf (2018). objects. In 2017 IEEE 13th International Conference on e-Science (e-Science). IEEE, [24] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya 266–275. Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI [15] Tony Hammond, Michele Pasin, and Evangelos Theodoridis. 2017. Data inte- Blog 1, 8 (2019). gration and disintegration: Managing Springer Nature SciGraph with SHACL [25] Petar Ristoski and Heiko Paulheim. 2016. RDF2Vec: RDF graph embeddings for and OWL.. In International Semantic Web Conference (Posters, Demos and Indus- data mining. In International Semantic Web Conference, Vol. 9981 LNCS. 498–514. try Tracks) (CEUR Workshop Proceedings), Nadeschda Nikitina, Dezhao Song, https://doi.org/10.1007/978-3-319-46523-4_30 Achille Fokoue, and Peter Haase (Eds.), Vol. 1963. CEUR-WS.org. http://dblp.uni- [26] Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne van den Berg, Ivan trier.de/db/conf/semweb/iswc2017p.html#HammondPT17 Titov, and Max Welling. 2018. Modeling Relational Data with Graph Convolu- [16] Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. In tional Networks. arXiv:1703.06103 EMNLP. [27] Mike Schuster and Kaisuke Nakajima. 2012. Japanese and korean voice search. [17] Omer Levy and Yoav Goldberg. 2014. Neural Word Embedding As Implicit In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing Matrix Factorization. In Proceedings of the 27th International Conference on Neural (ICASSP). IEEE, 5149–5152. Information Processing Systems - Volume 2 (NIPS’14). MIT Press, Cambridge, MA, [28] Noam Shazeer, Ryan Doherty, Colin Evans, and Chris Waterson. 2016. Swivel: USA, 2177–2185. http://dl.acm.org/citation.cfm?id=2969033.2969070 Improving Embeddings by Noticing What’s Missing. arXiv preprint (2016). [18] Massimiliano Mancini, José Camacho-Collados, Ignacio Iacobacci, and Roberto arXiv:1602.02215 Navigli. 2017. Embedding Words and Senses Together via Joint Knowledge- [29] Arfon M. Smith, Daniel S. Katz, and Kyle E. and Niemeyer. 2016. Software citation Enhanced Training. In CoNLL. principles. PeerJ Computer Science 2 (Sept. 2016), e86. https://doi.org/10.7717/ [19] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient peerj-cs.86 Estimation of Word Representations in Vector Space. CoRR abs/1301.3781 (2013). [30] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, arXiv:1301.3781 http://arxiv.org/abs/1301.3781 Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All [20] Maximilian Nickel, Lorenzo Rosasco, and Tomaso A. Poggio. 2016. Holographic You Need. CoRR abs/1706.03762 (2017). arXiv:1706.03762 http://arxiv.org/abs/ Embeddings of Knowledge Graphs. In AAAI. 1706.03762 [21] Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: [31] Yanshan Wang, Sijia Liu, Naveed Afzal, Majid Rastegar-Mojarad, Liwei Wang, Global vectors for word representation.. In EMNLP, Vol. 14. 1532–1543. Feichen Shen, Paul Kingsbury, and Hongfang Liu. 2018. A comparison of word [22] Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, embeddings for the biomedical natural language processing. Journal of Biomedical Kenton Lee, and Luke Zettlemoyer. 2018. Deep Contextualized Word Repre- Informatics 87 (2018), 12 – 20. https://doi.org/10.1016/j.jbi.2018.09.008 sentations. In Proceedings of the 2018 Conference of the North American Chapter [32] Mark Wilkinson and et al. 2016. The FAIR Guiding Principles for scientific of the Association for Computational Linguistics: Human Language Technologies, data management and stewardship. Nature Scientific Data 160018 (2016). http: Volume 1 (Long Papers). Association for Computational Linguistics, 2227–2237. //www.nature.com/articles/sdata201618 https://doi.org/10.18653/v1/N18-1202 [33] J Zhao, JM Gomez-Perez, K Belhajjame, G Klyne, E GarcÃŋa-Cuesta, A Garrido, [23] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. KM Hettne, M Roos, D De Roure, and C Goble. 2012. Why workflows break - Un- 2018. Improving language understanding by generative pre-training. derstanding and combating decay in Taverna workflows.. In 8th IEEE International Conference on E-Science. 1–9. https://doi.org/10.1109/eScience.2012.6404482