Learning Embeddings from Scientific Corpora using Lexical,
               Grammatical and Semantic Information
              Andres Garcia-Silva                                          Ronald Denaux                            Jose Manuel Gomez-Perez
            agarcia@expertsystem.com                                   rdenaux@expertsystem.com                       jmgomez@expertsystem.com
                  Expert System                                              Expert System                                 Expert System
                   Madrid, Spain                                             Madrid, Spain                                  Madrid, spain

ABSTRACT                                                                              machine-readability among other benefits. In addition, publishers
Natural language processing can assist scientists to leverage the in-                 have started releasing knowledge graphs such as Springer nature
creasing amount of information contained in scientific bibliography.                  SciGraph2 , an open linked data graph about publications from
The current trend, based on deep learning and embeddings, uses                        the editorial group and cooperating partners, and the Literature
representations at the (sub)word level that require large amounts of                  Graph in Semantic scholar [1]. Nevertheless the knowledge of the
training data and neural architectures with millions of parameters                    scholar communications is still mainly text which is difficult to
to learn successful language models, like BERT. However, these                        process by software agents. Research objects shed some light on the
representations may not be well suited for the scientific domain,                     publication content with the semantic annotations, however they
where it is common to find complex terms, e.g. multi-word, with                       are user-generated and scarce in existing repositories [14]. Semantic
a domain-specific meaning in a very specific context. In this pa-                     scholar, on the other hand, uses Natural Language Processing to
per we propose an approach based on a linguistic analysis of the                      extract keywords and identify topics relevant for the publications.
corpus using a knowledge graph to learn representations that can                          In fact NLP technology is progressing at a fast pace thanks to
unambiguously capture such terms and their meaning. We learn                          word embeddings [19] and pre-trained language models based on
embeddings from different linguistic annotations on the text and                      transformers[30] that have allowed to improve the state of the
evaluate them through a classification task over the SciGraph taxon-                  art on different evaluation tasks [12, 24]. Most of existing word
omy, showing that our representations outperform (sub)word-level                      embeddings and pre-trained language models use sequences of
approaches.                                                                           characters, word pieces, and words in a sentence as their main input.
                                                                                      However, in the scientific domain there are terms consisting of more
CCS CONCEPTS                                                                          than one word that have a domain-specific semantics. For example
                                                                                      the meaning of a term such as Molecularly imprinted polymer 3 can
• Computing methodologies → Natural language process-
                                                                                      be hardly identified from the single words, word pieces or other
ing; Neural networks; Semantic networks; Machine learning ap-
                                                                                      character-based representations, and hence the neural models used
proaches.
                                                                                      for NLP need to learn the relation between the single words, word
                                                                                      pieces or characters, requiring complex architectures with a high
KEYWORDS
                                                                                      number of parameters to optimize, and a huge amount of training
NLP, neural networks, convolutional neural networks, embeddings,                      data.
text classification                                                                       Scientific terminology is domain specific and scarce in a general
                                                                                      corpus and hence accumulating the necessary amount of evidence
1     INTRODUCTION                                                                    from documents to identify it a as single entity with a specific mean-
Nowadays scholarly communications are evolving, thanks to the                         ing is very unlikely if we analyse single words and sub words rep-
effort of research communities, funding agencies and publishers,                      resentations. On the other hand, precisely in the scientific domain,
beyond the conventional delivery method based on documents                            the amount of structured resources, including catalogs, taxonomies
to gain better visibility, reuse capabilities and to foster a broader                 and knowledge graphs with specific terminology and their corre-
data accessibility[8]. The list of enhancements is wide and include                   sponding definitions is available. Thus, the question raises, what is
the availability of supporting material such as code 1 and research                   the minimum information unit or their combination thereof, which
software[29], the use of Digital Object Identifiers to favor reusability              allows for efficient representations in vector form and at the same
and proper credit to authors, the emergence of specialized academic                   time can be linked to a semantically significant concept?
search engines such as semantic scholar, and the adoption of the                          In this paper we propose to generate embeddings using surface
FAIR principles [32] to make data findable, accessible, interoperable                 forms, lemmas and concepts that are able to represent complex
and reusable.                                                                         terms consisting of more than one word. The linguistic information
   Aligned with the FAIR principles, in particular with the goal of                   is the result of applying a linguistic analysis that relies on a knowl-
assisting humans and machines in managing data, research objects                      edge graph where linguistic knowledge is encoded. The linguistic
[2, 3, 13, 33] encapsulate and annotate semantically all the resources                analysis performs a grammatical, syntactical and semantic analysis
involved in a research endeavour enabling data interoperability and                   to recognize and disambiguate terms that can consist of more than
1 see the Data Citation initiative at https://doi.org/10.25490/a97f-egyk
                                                                                      2 SciGraph homepage: https://www.springernature.com/gp/researchers/scigraph
                                                                                      3 According to Wikipedia a Molecularly imprinted polymer is a polymer that has been
Copyright ©2019 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0).                                    processed using the molecular imprinting technique
Sciknow 2019, November 19th, 2019, Los Angeles, California, USA                                                               Garcia-Silva, et al.


one word. We generate embeddings from a scholarly communica-             3     LEVERAGING LEXICAL, GRAMMATICAL
tions corpus for single and joined representations (surface forms,             AND SEMANTIC INFORMATION
lemmas, part-of-speech, and concepts). We experiment with these
                                                                         To learn embeddings from different linguistic annotations we use
embeddings in a text classification task where the goal is to classify
                                                                         Vecsigrafo [11], a method to learn embedding for linguistic annota-
academic publications in a topic taxonomy.
                                                                         tions on a text corpus. Vecsigrafo extends the Swivel algorithm [28]
   Our results show that using linguistic annotation embeddings
                                                                         to jointly learn embeddings for surface forms, lemmas, grammar
helps to learn better classifiers when compared to those learned
                                                                         types, and concept on a corpus enriched with linguistic annota-
only with words or subword embeddings. According to our exper-
                                                                         tions. Vecsigrafo embeddings outperformed the previous state of
imentation the best approach is to use surface form and lemma
                                                                         the art in word and word-sense embeddings by co-training surface
embeddings jointly. When surface form and lemma embeddings are
                                                                         form, lemma and concept embeddings as opposed to training each
enriched with gramma information embeddings, like part-of-speech
                                                                         individually.
tag embeddings, the classifier with the greatest precision is learned.
                                                                            In contrast to simple tokens produced by space separation tok-
On the other hand, concept embeddings results were mixed proba-
                                                                         enization, linguistic annotations used in Vecsigrafo are based on
bly due to the general-purpose annotator used in the experiments
                                                                         terms that are related to one or more words. Surface forms are
with a limited coverage of the scientific domain vocabulary.
                                                                         terms as they appear in the text, and lemmas are the base form of
   This papers is structured as follows. Section 2 describes the
                                                                         these terms. Source forms and lemmas can refer to concepts in a
related work and the paper contributions. Section 3 summarizes the
                                                                         knowledge graph. For example, table 1 shows the linguistic annota-
approach to learn the embeddings for linguistic annotations. Next,
                                                                         tions added to a text excerpt taken from a publication. Note how at
Section 4 presents the experimental work where we evaluate the
                                                                         the surface form level some tokens are grouped into terms like local
embeddings in a text classification task. Finally section 5 presents
                                                                         anesthetic and phrenic nerve, and at the lemma level some surface
the conclusions and future lines of work.
                                                                         forms such as concerns and relating are turned into their base form
                                                                         concern and relate. The grammar information indicates the role of
                                                                         the terms as nouns (N), verbs(V), noun and verb phrases (NP, VP),
                                                                         prepositions (P) and punctuation marks (PNT). In addition some of
                                                                         the terms are related to the concepts like like local anesthetic that
2    RELATED WORK                                                        is annotated with the concept en%23107824862 that is defined as An
                                                                         anesthetic that numbs a local area.
Recent work in distributional representation of words has moved
                                                                            Formally, Vecsigrafo generates, from a corpus an embedding
from static [6, 17, 19, 21, 28] to contextualized word embeddings
                                                                         space Φ = {(x, e) : x ∈ SF ∪ L ∪ G ∪ C, e ∈ Rn } where SF , L, G, and
[12, 22], in an effort to generate them dynamically according to the
                                                                         C are sets of surface forms, lemmas, grammar types, and concepts.
context and deal with phenomena like polysemy and homonymy.
                                                                         One of the benefits of Vecisgrafo is that concept embeddings con-
A main problem with traditional words embeddings is that un-
                                                                         tribute to identifying the intended meaning of ambiguous terms
seen words or rare words are not represented in the distributional
                                                                         in the corpus since the term and concept embeddings are learned
space and hence considered as out-of-vocabulary (OOV) words. To
                                                                         jointly. To use Vecsigrafo embeddings in Φ we need to annotate the
overcome the OOV problem different embedding representations
                                                                         target corpus with the linguistic elements used to learn the embed-
have been proposed including character level used in ELMO [22],
                                                                         dings. Note that embeddings representing linguistic annotations
character-n-grams used in FastText [5], subwords used in GPT [23]
                                                                         for the same term can be merged to generate a single embedding
and word pieces [27] used in BERT [12].
                                                                         for the term, for example, by applying vector operations such as
   In parallel researchers have proposed to learn jointly concepts
                                                                         concatenation or averaging, or dimensional reduction techniques
and word embeddings as an alternative approach to cope with the
                                                                         like PCA or SVD.
ambiguity of the language. For example Camacho-Collados et al.
[9] relies on Wikipedia and Chen et al. [10] on WordNet to generate
                                                                         4     EXPERIMENTAL WORK
concept embeddings. Many approaches learn embeddings straight
from knowledge graphs [7, 20, 25, 26], and others use linguistic         In this section we describe the scholarly communication corpus
annotations on a text corpus [11, 18].                                   used to learn the linguistic embeddings, the NLP toolkit used to
   In the scientific domain, Wang et al. [31] highlighted the limita-    annotate the corpus, the neural network that uses the linguistic
tions of general-purpose word embeddings in NLP tasks. So as to          embeddings to classify the research publications, and report the
deal with such limitations Beltagy et al. [4] use BERT[12] to learn      evaluation results of the classifiers.
embeddings from the scientific domain. In this work adopt the
Vecsigrafo approach [11] to generate embeddings from a scientific        4.1    Embeddings for Scholarly Communications
corpus for surface forms, lemmas and concepts. The vecsigrafo em-        SciGraph [15] is a linked open data platform for the scientific do-
beddings encodes linguistic information in contrast to approaches        main. It contains information from the complete research process:
like Beltagy et al. [4] that relies on word pieces.                      research projects, conferences, authors and publications, among
   The main contribution of this paper is a comprehensive exper-         others. The knowledge graph contains more than 1 billion facts
imentation in the scientific domain with Vecsigrafo embeddings           about objects of interest to the scholarly domain, distributed over
jointly learned from linguistic annotations and compare them with        some 85 million entities described using 50 classes and more than
word and subword embeddings.                                             250 properties. Most of the knowledge graph is available under CC
Learning Embeddings from Scientific Corpora using Lexical, Grammatical and Semantic Information        Sciknow 2019, November 19th, 2019, Los Angeles, California, USA


   Concept        en%2326973 en%2377696 -        en%23107824862 en%23100274160          -     en%23100737313 en%23101569578
   Grammar              N           V        P           NP                 N          PNT           NP                 N
    Lemma            concern      relate     to   local anesthetic       toxicity        ,     phrenic nerve        blockade
 Surface Form       concerns     relating    to   local anesthetic       toxicity        ,     phrenic nerve        blockade
     Token          concerns     relating    to local anesthetic         toxicity        ,    phrenic nerve         blockade
Table 1: Linguistic annotations and tokens generated for the text excerpt "concerns relating to local anesthetic toxicity, phrenic
nerve blockade" extracted from an actual publication.


          Linguistic                                                                              Embeddings
                            Total      Distinct      Embeddings                                                       Precision       Recall      F-measure
         annotations                                                                              Generation
            Token           707M      1,486,848        1,486,848                            Normal Distribution          0,7596        0,6775       0,7015△
         Surface Form       805M      5,090,304         692,224                             Optimized by CNN             0,8062         0,767       0,7806▽
            Lemma           508M      4,798,313         770,048                       Table 3: Evaluation results for classifiers using token-based
           grammar          804M         25                8                          embeddings generated randomly and following the normal
           Concept          425M       212,365          147,456                       distribution (baseline) and optimized by the convolutional
Table 2: Token and linguistic annotations, and embeddings                             neural network (upper bound)
generated from text in the title and abstract of research arti-
cles and book chapters published between 2001 to 2017 and
available in Scigraph. The number of distinct linguistic an-
notations is different than the embeddings because we filter                          4.2         Evaluation Task
out articles and auxiliary verbs and apply a minimum fre-                             Publications in Scigraph have one or more field of research codes
quency threshold.                                                                     that classify the documents in 22 categories such as Mathematical
                                                                                      Sciences, Engineering or Medical and Health Sciences. Thus, we can
                                                                                      formulate a multi-label classification task that aims at predicting
                                                                                      one or more of these 22 first level categories for each publication.
BY 4.0 License (i.e., attribution) with the exception of abstracts and                    Embeddings are the natural numerical representation of text
grant metadata, which are available under CC BY-NC 4.0 License                        for neural networks. Kim [16] shows that Convolutional neural
(i.e., attribution and non-comercial) A core ontology expressed in                    networks CNN were fitted for text classification and his results
OWL encodes the semantics of the data in the knowledge graph                          improved the state of the art on different text classification tasks
consisting of 47 classes and 253 properties. From SciGraph we ex-                     and benchmarks. CNN are based on convolutional layers that slide
tract publications including articles and book chapters published                     filters (aka kernels) across the input data and return the dot products
from 2001 to 2017. We use the titles and abstracts of the publications                of the elements of the filter and each fragment of the input. These
to generate the corpus with roughly 3.2 million publications, 1.4                     convolutions allows the network to learn features from the data,
million distinct words, and 700 million tokens.                                       alleviating the manual selection required in traditional approaches.
    Next we use Expert System NLP suit (Cogito) to parse the text                     Stacking several convolutional layers allows feature composition,
and add linguistic annotations. Cogito disambiguator relies on its                    increasing the level of abstraction from the initial layers to the
own knowledge graph called Sensigrafo, that encodes the linguistic                    output.
knowledge in a way similar to WordNet, and applies a rule-based                           To learn the classifier we use an off the shelf CNN implemen-
approach to disambiguation. The Sensigrafo contains about 400K,                       tation available in Keras, with 3 convolutional layers, 128 filters
lemmas and 300K concepts interlinked via 61 relation types. Note                      and a 5-element window size. As corpus we use 187795 articles
that we could have used any other NLP toolkit as long as it generates                 available in SciGraph published in 2011. To evaluate the classifiers
the linguistic annotations used in this work. The corpus parsing                      we use ten-fold cross-validation and precision, recall and f-measure
and annotations generated by Cogito are reported in table 2.                          as metrics. We use a vocabulary with maximum 20K entries, and
    For each linguistic element we learned an initial set of embed-                   sequences size 1000.
dings with 300 dimensions using Vecsigrafo. The difference between                        As baseline, we train a classifier that learns from embeddings
the number of learned embeddings and the linguistic annotations                       generated randomly following a normal distribution. As upper
is due to a filter that we applied based on previous results [11]. We                 bound we learn a classifier that is able to optimize the embeddings
filter out elements with grammar type article, punctuation mark                       in the learning process. The evaluation of baseline and upper bound
or auxiliary verbs and generalize tokens with grammar type entity                     classifiers are presented in table 3.
or person proper noun, replacing the original token with special
tokens grammar#ENT and grammar#NPH respectively. In addition,                         4.3         Classifiers using vecsigrafo embeddings
to these embeddings, we learned 10 Vecsigrafo embedding spaces                        We train classifiers using single Vecsigrafo embeddings for each
for the possible combinations of size 2 and 3 between the linguistic                  linguistic annotation (sf, l, c) and for the ten 2, and 3 size combi-
elements sf, l, g and c.                                                              nations of (sf, l, g, c). Grammar embeddings were not evaluated
Sciknow 2019, November 19th, 2019, Los Angeles, California, USA                                                                   Garcia-Silva, et al.


independently due to the low number of distinct grammar types                    Linguistic
                                                                                                  Merging   Precision   Recall   F-Measure↓
used to annotate the terms. When using embeddings of two or                     Annotations
three linguistic annotations two different approaches are used. The                       sf_l      -        0,8104     0,7638     0,7818
first approach relies on a single vocabulary containing at most 20K                    sf_l_c       -        0,8135     0,7598     0,7809
entries per each linguistic annotations in the text, and no merging                         l_c     -        0,8102     0,7604     0,7797
operation is carried out, while in the second one embeddings are                        l_g_c       -        0,8099     0,7592     0,7791
                                                                                      sf_g_c        -        0.8126     0.7585     0.7790
merged using concatenation or average. Evaluations results are
                                                                                          sf_l     Avg       0,8093     0,7588     0,7787
reported in table 4.                                                                    l_g_c      Avg       0,8125     0,7558     0,7779
                                                                                      sf_l_g        -        0,8144     0,7549     0,7779
4.4     Lemmas better than surface forms and                                           sf_l_c      Avg       0,8080     0,7581     0,7773
        tokens                                                                        sf_g_c       Avg       0.8137     0.7548     0.7769
                                                                                      sf_l_g      Concat     0,8148     0,7543     0,7765
Regarding single linguistic annotations, lemma l and surface                                l_c    Avg       0,8040     0,7592     0,7763
form sf embeddings contribute to learn the better classifier                              sf_c      -        0,8096     0,7549     0,7754
than using token t embeddings respectively. This shows that                                l_g      -        0,8121     0,7498     0,7728
the classifier learning process benefits from the conflation of differ-                       l     -        0,8035     0,7539     0,7728
ent term and word variations (sf, t) into a base form (l). However,                       sf_c     Avg       0,8023     0,7543     0,7722
                                                                                           l_g    Concat     0,8077     0,7472     0,7688
grouping raw tokens into terms (sf ) only generates a slight im-
                                                                                             sf     -        0,8030     0,7477     0,7684
provement in the classifier performance with respect to using only                            t     -        0,8008     0,7491     0,7679
tokens (t). On the other hand, concept (c) embeddings performance                        sf_g       -        0,8124     0,7387     0,7653
in this task is worst than t embeddings. The low number of c em-                              c     -        0,7973     0,7453     0,7650
beddings (see table 2) compared to the number of tokens and the                          sf_g     Concat     0,8101     0,7317     0,7648
other linguistic annotations affect negatively the learning process.                       c_g      -        0,8095     0,7357     0,7629
                                                                                           c_g    Concat     0,8076     0,7320     0,7596
The difference between concepts and tokens is consequence of lim-
ited coverage of the general-purpose annotator used in a highly           Table 4: Classifiers learned using vecsigrafo embeddings
specialized domain as the scientific.                                     and token embeddings (in grey row) sorted descently by F-
                                                                          Measure. Only the best classifier for either average or con-
4.5     Lemmas and surface forms the best                                 catenation merging operation is reported. Italic and Bold
        combination                                                       font indicate the top 5 results per metric. The top value per
                                                                          metric is underlined
To analyse the results of the different combinations of embeddings
for linguistic annotations we focus on each evaluation metric. Re-
garding precision the top 2 classifiers are learned from combina-
tions of sf, l and g. In addition note that the common linguistic
element in the top 6 classifiers is g combined either with sf or l,       needs a high precision and a high recall. The combination of
and in general removing g produced least precise classifiers. Thus,       surface forms sf and lemmas l embeddings is at the top of
precision-wise the part-of-speech information in combina-                 the f-measure ranking, followed by their combination with c. In
tion with surface forms and lemmas is very relevant. Seman-               general, concept embeddings improves the f-measure when com-
tic information (c) also contributes to enhance precision when it         bined with either lemmas or surface forms. However, when used in
is combined with lemmas and surface forms, or with lemmas and             conjunction with lemmas and surface form embeddings the perfor-
grammar information. In addition, the precision of 16 classifiers         mance is worse. In general, due to the low coverage of concepts in
out of 22 is better than the upper bound reported in table 3, where       the scientific domain the classifiers that relies only on c embeddings
the embeddings are optimized in the classifier learning phase, even       perform worst even when combined with grammar information.
though vecsigrafo embeddings were not learned for this specific           Similarly surface forms offer poor performance when combined
purpose.                                                                  with grammar information.
    The recall analysis shows a different picture since the grammar          Finally note how the best classifiers were learned when the
information (g) does not seem to have a decisive role on the clas-        linguistic annotation embeddings are used independently which
sifier performance. Surface forms and lemmas generates the                contrast to the worse results achieved when merging the embed-
classifier with highest recall. Nevertheless, in this analysis con-       dings.
cepts (c) gain more relevance always in combination with either sf
or l. The combination of l and c seems to benefit recall since it is      4.6    Words and subwords
presented in 3 of the top 5 classifiers. In contrast, when concepts are   We also test embeddings generated from word constituents. We
combined with sf the recall is lower. In general g-based embedding        resorted to FastText[6] since Vecsigrafo approach was not designed
combinations generate classifiers with lower recall. Note that none       to generate embeddings for word constituents. We use FastText to
of the classifiers reached the recall of the upper bound classifier.      generate token and character-ngram embeddings, with n ranging
    The f-measure data shows more heterogeneous results since             from 3 to 6. We use these embeddings to learn the classifiers using
by definition it is the harmonic mean of precision and recall, and        the same CNN architecture and evaluation procedure used in the
hence the embedding combinations that generate the best f-measure         experiments described above. Evaluation results, presented in table
Learning Embeddings from Scientific Corpora using Lexical, Grammatical and Semantic Information          Sciknow 2019, November 19th, 2019, Los Angeles, California, USA


                 FastText                                                             the other hand, were less helpful in general mainly due to the low
                                  Precision     Recall    F-Measure
                Embeddings                                                            coverage of concepts in the scientific domain. Since part of the anal-
                           t        0.8236      0.7493       0.7770                   ysis that identify surface forms and lemmas are based on lexical
        t + character-ngrams        0.8255      0.7429       0.7724                   and syntactical analysis the coverage was higher.
Table 5: Evaluations of a classifier learned character-ngrams                            As future work we want evaluate the linguistic annotation em-
generated with FastText.                                                              beddings on other evaluation tasks different from text classification
                                                                                      where understanding the the glossary can have more impact like
                                                                                      entailment and question and answering. In addition, another line
                                                                                      of research is to evaluate the impact of the linguistic annotations
                                                                                      when used as input representation to learn language models.
5, shows that token embeddings are better than using token and
character-ngram embeddings, which is in line with our assumption
                                                                                      ACKNOWLEDGMENTS
that using subword representations could be not convinient in the
scientific domain. Note that one of the benefits of using character-                  This research has been supported by The European Language Grid
ngram embeddings is to avoid the out of the vocabulary words                          project funded by the European Unions Horizon 2020 research and
(OOV). However, in our case, the embeddings were learned from                         innovation programme undergrant agreement No 825627 (ELG).
the whole scigraph corpus so we do not face the OOV problem in
our experiments.                                                                      REFERENCES
   On the other hand, results in table 4 and 5 are not directly com-                   [1] Waleed Ammar, Dirk Groeneveld, Chandra Bhagavatula, Iz Beltagy, Miles Craw-
                                                                                           ford, Doug Downey, Jason Dunkelberger, Ahmed Elgohary, Sergey Feldman,
parable since the embeddings are generated with a different algo-                          Vu Ha, Rodney Kinney, Sebastian Kohlmeier, Kyle Lo, Tyler C. Murray, Hsu-
rithms (FastText vs Vecsigrafo). For example FastText token em-                            Han Ooi, Matthew E. Peters, Joanna L. Power, Sam Skjonsberg, Lucy Lu Wang,
beddings generate a better classifier than using Vecisgrafo token                          Christopher Wilhelm, Zheng Yuan, Madeleine van Zuylen, and Oren Etzioni.
                                                                                           2018. Construction of the Literature Graph in Semantic Scholar. In NAACL-HLT.
embeddings, and remarkably FastText embeddings in both cases                           [2] S Bechhofer, I Buchan, D De Roure, P Missier, J Ainsworth, J Bhagat, P Couch,
reach the highest precision of all the tested embeddings. Never-                           D Cruickshank, M Delderfield, I Dunlop, M Gamble, D Michaelides, S Owen, D
                                                                                           Newman, S Sufi, and C Goble. 2013. Why linked data is not enough for scientists.
theless, we can see that the f-measure of the classifier that uses                         Future Generation Computer Systems 29, 2 (2013), 599 – 611. https://doi.org/10.
FastText character-ngram embeddings is lesser than the first 11                            1016/j.future.2011.08.004 Special section: Recent advances in e-Science.
results reported in table 4, including the classifier that uses only                   [3] K Belhajjame, O Corcho, D Garijo, J Zhao, P Missier, DR Newman, R Palma, S
                                                                                           Bechhofer, E Garcia-Cuesta, JM Gomez-Perez, G Klyne, K Page, M Roos, JE Ruiz, S
lemmas.                                                                                    Soiland-Reyes, L Verdes-Montenegro, D De Roure, and C Goble. [n. d.]. Workflow-
                                                                                           Centric Research Objects: A First Class Citizen in the Scholarly Discourse. 1–12.
                                                                                           http://ceur-ws.org/Vol-903/paper-01.pdf
5    CONCLUSIONS                                                                       [4] Iz Beltagy, Arman Cohan, and Kyle Lo. 2019. SciBERT: Pretrained Contextualized
Natural language processing has the potential to help scientists to                        Embeddings for Scientific Text. arXiv:arXiv:1903.10676
                                                                                       [5] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016. En-
manage and get insights out of the huge amount of scholarly com-                           riching Word Vectors with Subword Information. arXiv preprint arXiv:1607.04606
munications available. Nowadays deep learning techniques based                             (2016).
on word embeddings and language models have advanced the state                         [6] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017.
                                                                                           Enriching word vectors with subword information. Transactions of the Association
of the art in different NLP tasks. Nevertheless, the predominant                           for Computational Linguistics 5 (2017), 135–146.
approach in NLP is to use word or subword representations as the                       [7] Antoine Bordes, Nicolas Usunier, Jason Weston, and Oksana Yakhnenko.
                                                                                           2013. Translating Embeddings for Modeling Multi-Relational Data. Advances
input of deep neural architectures that requires large corpora to                          in NIPS 26 (2013), 2787–2795. https://doi.org/10.1007/s13398-014-0173-7.2
learn performing language models. However, in contrast to general-                         arXiv:arXiv:1011.1669v3
purpose corpora the scientific vocabulary often contains complex                       [8] Philip E. Bourne, Timothy W. Clark, Robert Dale, Anita de Waard, Ivan Herman,
                                                                                           Eduard H. Hovy, and David Shotton. 2012. Improving The Future of Research
terms comprising more than one word with the additional charac-                            Communications and e-Scholarship (Dagstuhl Perspectives Workshop 11331).
teristic that these terms are very specific and only make sense in                         Dagstuhl Manifestos 1, 1 (2012), 41–60. https://doi.org/10.4230/DagMan.1.1.41
certain fields of knowledge (e.g., Cosmic Microwave Background                         [9] José Camacho-Collados, Mohammad Taher Pilehvar, and Roberto Navigli. 2016.
                                                                                           NASARI: Integrating explicit knowledge and corpus statistics for a multilingual
Radiation). Thus models using word or subword representations                              representation of concepts and entities. Artificial Intelligence 240 (2016), 36–64.
could have problems to gather the necessary textual evidence to                            https://doi.org/10.1016/j.artint.2016.07.005
                                                                                      [10] Xinxiong Chen, Zhiyuan Liu, and Maosong Sun. 2014. A Unified Model for Word
capture their meaning.                                                                     Sense Representation and Disambiguation. In EMNLP. 1025–1035.
    To overcome the word and subword representation limitation we                     [11] R Denaux and JM Gomez-Perez. 2019. Vecsigrafo: Corpus-based Word-Concept
propose to use embeddings based on linguistic annotations such as                          Embeddings-Bridging the Statistic-Symbolic Representational Gap in Natural
                                                                                           Language Processing. To appear in Semantic Web Journal http://www.semantic-
surface forms, lemmas, part-of-speech information, and concepts.                           web-journal.net/system/files/swj2148.pdf (2019).
These embeddings are jointly learned from a corpus of scientific                      [12] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT:
communications using an existing approach called Vecsigrafo. We                            Pre-training of Deep Bidirectional Transformers for Language Understanding.
                                                                                           arXiv preprint arXiv:1810.04805 (2018).
evaluate the linguistic annotation embeddings in a multilabel clas-                   [13] Andres Garcia-Silva, Jose Manuel Gomez-Perez, Raul Palma, Marcin Krystek,
sification where the goal was to assign a scientific topic to each                         Simone Mantovani, Federica Foglini, Valentina Grande, Francesco De Leo, Ste-
                                                                                           fano Salvi, Elisa Trasatti, Vito Romaniello, Mirko Albani, Cristiano Silvagni,
publication. Our evaluations results show that lemmas help to learn                        Rosemarie Leone, Fulvio Marelli, Sergio Albani, Michele Lazzarini, Hazel J.
better classifiers than using space-separated words and subword                            Napier, Helen M. Glaves, Timothy Aldridge, Charles Meertens, Fran Boler,
representations based on character-ngrams. The best results were                           Henry W. Loescher, Christine Laney, Melissa A. Genazzio, Daniel Crawl, and
                                                                                           Ilkay Altintas. 2019. Enabling FAIR research in Earth Science through re-
achieved when lemma and surface forms were used jointly. Gram-                             search objects. Future Generation Computer Systems 98 (2019), 550 – 564.
mar information was very useful for high precision. Concepts, on                           https://doi.org/10.1016/j.future.2019.03.046
Sciknow 2019, November 19th, 2019, Los Angeles, California, USA                                                                                               Garcia-Silva, et al.


[14] Jose Manuel Gomez-Perez, Raul Palma, and Andres Garcia-Silva. 2017. Towards                URL        https://s3-us-west-2.   amazonaws.        com/openai-assets/research-
     a human-machine scientific partnership based on semantically rich research                 covers/languageunsupervised/language understanding paper. pdf (2018).
     objects. In 2017 IEEE 13th International Conference on e-Science (e-Science). IEEE,   [24] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya
     266–275.                                                                                   Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI
[15] Tony Hammond, Michele Pasin, and Evangelos Theodoridis. 2017. Data inte-                   Blog 1, 8 (2019).
     gration and disintegration: Managing Springer Nature SciGraph with SHACL              [25] Petar Ristoski and Heiko Paulheim. 2016. RDF2Vec: RDF graph embeddings for
     and OWL.. In International Semantic Web Conference (Posters, Demos and Indus-              data mining. In International Semantic Web Conference, Vol. 9981 LNCS. 498–514.
     try Tracks) (CEUR Workshop Proceedings), Nadeschda Nikitina, Dezhao Song,                  https://doi.org/10.1007/978-3-319-46523-4_30
     Achille Fokoue, and Peter Haase (Eds.), Vol. 1963. CEUR-WS.org. http://dblp.uni-      [26] Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne van den Berg, Ivan
     trier.de/db/conf/semweb/iswc2017p.html#HammondPT17                                         Titov, and Max Welling. 2018. Modeling Relational Data with Graph Convolu-
[16] Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. In              tional Networks. arXiv:1703.06103
     EMNLP.                                                                                [27] Mike Schuster and Kaisuke Nakajima. 2012. Japanese and korean voice search.
[17] Omer Levy and Yoav Goldberg. 2014. Neural Word Embedding As Implicit                       In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing
     Matrix Factorization. In Proceedings of the 27th International Conference on Neural        (ICASSP). IEEE, 5149–5152.
     Information Processing Systems - Volume 2 (NIPS’14). MIT Press, Cambridge, MA,        [28] Noam Shazeer, Ryan Doherty, Colin Evans, and Chris Waterson. 2016. Swivel:
     USA, 2177–2185. http://dl.acm.org/citation.cfm?id=2969033.2969070                          Improving Embeddings by Noticing What’s Missing. arXiv preprint (2016).
[18] Massimiliano Mancini, José Camacho-Collados, Ignacio Iacobacci, and Roberto                arXiv:1602.02215
     Navigli. 2017. Embedding Words and Senses Together via Joint Knowledge-               [29] Arfon M. Smith, Daniel S. Katz, and Kyle E. and Niemeyer. 2016. Software citation
     Enhanced Training. In CoNLL.                                                               principles. PeerJ Computer Science 2 (Sept. 2016), e86. https://doi.org/10.7717/
[19] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient                   peerj-cs.86
     Estimation of Word Representations in Vector Space. CoRR abs/1301.3781 (2013).        [30] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
     arXiv:1301.3781 http://arxiv.org/abs/1301.3781                                             Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All
[20] Maximilian Nickel, Lorenzo Rosasco, and Tomaso A. Poggio. 2016. Holographic                You Need. CoRR abs/1706.03762 (2017). arXiv:1706.03762 http://arxiv.org/abs/
     Embeddings of Knowledge Graphs. In AAAI.                                                   1706.03762
[21] Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove:           [31] Yanshan Wang, Sijia Liu, Naveed Afzal, Majid Rastegar-Mojarad, Liwei Wang,
     Global vectors for word representation.. In EMNLP, Vol. 14. 1532–1543.                     Feichen Shen, Paul Kingsbury, and Hongfang Liu. 2018. A comparison of word
[22] Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark,                embeddings for the biomedical natural language processing. Journal of Biomedical
     Kenton Lee, and Luke Zettlemoyer. 2018. Deep Contextualized Word Repre-                    Informatics 87 (2018), 12 – 20. https://doi.org/10.1016/j.jbi.2018.09.008
     sentations. In Proceedings of the 2018 Conference of the North American Chapter       [32] Mark Wilkinson and et al. 2016. The FAIR Guiding Principles for scientific
     of the Association for Computational Linguistics: Human Language Technologies,             data management and stewardship. Nature Scientific Data 160018 (2016). http:
     Volume 1 (Long Papers). Association for Computational Linguistics, 2227–2237.              //www.nature.com/articles/sdata201618
     https://doi.org/10.18653/v1/N18-1202                                                  [33] J Zhao, JM Gomez-Perez, K Belhajjame, G Klyne, E GarcÃŋa-Cuesta, A Garrido,
[23] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever.                        KM Hettne, M Roos, D De Roure, and C Goble. 2012. Why workflows break - Un-
     2018.      Improving language understanding by generative pre-training.                    derstanding and combating decay in Taverna workflows.. In 8th IEEE International
                                                                                                Conference on E-Science. 1–9. https://doi.org/10.1109/eScience.2012.6404482