<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Classification of Keyphrases from Scientific Publications using WordNet and Word Embeddings</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Davide Buscaldi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Simon David Hernandez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thierry Charnois</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Laboratoire d'Informatique de Paris Nord, CNRS (UMR 7030) Université Paris 13, Sorbonne Paris Cité</institution>
          ,
          <addr-line>Villetaneuse</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <fpage>50</fpage>
      <lpage>57</lpage>
      <abstract>
        <p>The ScienceIE task at SemEval-2017 introduced an epistemological classification of keyphrases in scientific publications, suggesting that research activities revolve around the key concepts of process (methods and systems), material (data and physical resources) and task. In this paper we present a method for the classification of keyphrases according to the ScienceIE classification, using WordNet and word embeddings derived features. The method outperforms the best system at SemEval-2017, although our experiments highlighted some issues with the collection. RÉSUMÉ. Dans le contexte du challenge ScienceIE à SemEval-2017, ses organisateurs ont introduit une classification des phrases clés dans les publications scientifiques. Selon leur hypothèse, les activités de recherche tournent autour des concepts clés de “process" (methodes, systèmes), “material" (ressources matériellles, données, produits) et “task" (problèmes, activités à poursuivre). Dans cet article, nous présentons une méthode pour la classification des phrases clés selon la classification donné par ScienceIE, en utilisant des caractéristiques dérivées à partir de WordNet et de “word embeddings". La méthode proposée dépasse le meilleur système au SemEval-2017; toutefois, nos expériences ont mis en évidence certains problèmes d'annotation avec la collection.</p>
      </abstract>
      <kwd-group>
        <kwd>Information Extraction</kwd>
        <kwd>Text Mining on Scientific Literature</kwd>
        <kwd>Keyphrase extraction</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>2</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        Nowadays, the number of scientific publications is continuously growing, in all
disciplines. According to
        <xref ref-type="bibr" rid="ref3">(Bjork et al., 2009)</xref>
        , 1.35 million articles were published
in indexed journals in the single year 2006, and the growth rate in the number of
scientific publications has been estimated by
        <xref ref-type="bibr" rid="ref7">(Larsen, Von Ins, 2010)</xref>
        to be between
2.2% and 9% for journals and between 1.6% and 14% for conferences (depending on
the disciplines) in the decade 1997-2007. It is becoming more and more difficult to
search some informations required to write scientific papers, review the work of other
researchers, or looking for expert. Usually this kind of search involves checking the
originality of an idea or a method. Current search engines dedicated to the exploration
of scientific literature, such as Google scholar1 and Scopus2, are based on text-based
search, author and citation graphs. Recent works from the semantic web,
scientometry and natural language processing communities have been aimed to improve the
access to scientific literature
        <xref ref-type="bibr" rid="ref10">(Osborne, Motta, 2015; Wolfram, 2016)</xref>
        , and some
initiatives have been started to gather researchers around this problem, like the SAVE-SD3
workshops and the ScienceIE task
        <xref ref-type="bibr" rid="ref1">(Augenstein et al., 2017)</xref>
        at SemEval20174.
      </p>
      <p>
        In particular, the ScienceIE task was focused on extracting keyphrases and
relations between them, relying on the hypothesis that the ability of correctly
recognising these semantic items in text will help in tasks related to the process of scientific
publishing, such as to recommend articles to readers, highlight missing citations to
authors, identify potential reviewers for submissions, and analyse research trends over
time. The hypothesis made by the organizers is that some concepts, notably
PROCESS, TASK and MATERIAL, are cardinal in scientific works, since they allow to
answer questions like: “which papers addressed a Task using variants of some Process
?". In their vision, Processes correspond to methods and equipments and Materials to
corpora and physical items. An example of text labelled with these concept is shown
in Figure 1
        <xref ref-type="bibr" rid="ref1">(Augenstein et al., 2017)</xref>
        .
      </p>
      <p>In this paper, we propose a method to classify candidate terms into the three
categories defined in the ScienceIE challenge, using surface features combined with
WordNet-based features and word embeddings. This method outperforms the best
!!!! Short title too long !!!!
3
result obtained at ScienceIE. In the remainder of this paper, we describe the method
and the features used in Section 2, then we show the obtained results in Section 3, and
finally we draw some conclusions in Section 4</p>
    </sec>
    <sec id="sec-3">
      <title>2. Proposed Method</title>
      <p>
        The method we propose in this paper is based on Support Vector Machines (SVM),
in particular the nu-SVM implementation by
        <xref ref-type="bibr" rid="ref4">(Chang, Lin, 2011)</xref>
        . SVMs are well
known maximum margin classifiers; we chose them because of their robustness with
regard to problems with a large number of features. Please note that the method we
are describing in this paper only shares part of the WordNet-based features with the
one we used to participate to the task
        <xref ref-type="bibr" rid="ref6">(Hernandez et al., 2017)</xref>
        .
      </p>
      <sec id="sec-3-1">
        <title>2.1. Base Features</title>
        <p>The base features are constituted by all the {3,4,5}-prefixes and suffixes of
keyphrases that appeared in the training set with frequency greater than 10. For
instance, from the keyphrase “information extraction" we can identify the following
features: inf, info, infor as prefixes and ction, tion, ion as suffixes. Together with the
prefixes and suffixes, we have considered the following features:
– capitalization of the keyphrase (binary);
– uppercase ratio, calculated as number of uppercase characters divided by
number of characters in the keyphrase;
– number of digits in the keyphrase;
– number of dashes;
– number of words.</p>
      </sec>
      <sec id="sec-3-2">
        <title>2.2. WordNet-based Features</title>
        <p>
          WordNet
          <xref ref-type="bibr" rid="ref9">(Miller, 1995)</xref>
          is a well known lexical database for the English language.
In WordNet, word senses are represented as synsets, or “set of synonyms", which may
be connected to other synsets by some relationship. Some of the most common
relationships are meronymy (part-of) and hyperonymy (is-a). We define a synpath as
the list of synsets connecting a sense of a target word to the root of the hierarchy in
WordNet, following the hyperonymy relation. In Figure 2 we show the synpaths
corresponding to the three senses of the word extraction in WordNet 3.0. The definitions
of the senses are as follows:
        </p>
        <p>1. extraction#1:the process of obtaining something from a mixture or compound
by chemical or physical or mechanical means;
2. extraction#2: properties attributable to your ancestry;
3. extraction#3: the action of taking out something (especially using effort or
force).
4</p>
        <p>From Figure 2 it can be observed that the synset process is in the synpath
(process, physical_entity) of extraction#1, which seems an important clue to classify
this keyword as a PROCESS, according to the ScienceIE classification. Therefore,
we supposed that synpaths can be effectively used as features to predict the category
of a keyword. Given the number of synsets in WordNet (more than 117; 000), we
opted to select only a subset of those synset, in particular by limiting the scope to the
synsets that are particularly distinctive for each of the three classes. We calculated,
on the training corpus of ScienceIE, the probability p(sjC) for each synset with
respect to class C. Subsequently, we ordered in decreasing order, for each class, the
synsets according to the difference p(sjCi) p(sjCj)+2p(sjCk) . We show in Table 1
the most distinctive synsets for each category. The semantic correlation between the
MATERIAL category and its distinctive synsets is particularly evident.
!!!! Short title too long !!!!
5</p>
        <p>We arbitrarily selected the top 20 distinctive synsets for each category and we used
them to extract some binary features5. These features are true for a token if they are
present in any of the hypernym paths connecting the noun synsets to the root synset.
Note that these features were added only for the nouns, since there is no hierarchy for
the other lexical categories (if we exclude verbs, whose hierarchy is in any case very
shallow, if compared to nouns). If the keyphrase was composed by more terms, then
we searched the synpaths for the rightmost noun in the keyphrase.</p>
      </sec>
      <sec id="sec-3-3">
        <title>2.3. Word Embeddings Features</title>
        <p>
          Word embeddings, as introduced by
          <xref ref-type="bibr" rid="ref2">(Bengio et al., 2006)</xref>
          , are vector
representations of words that capture a certain number of syntactic and semantic relationships,
generated with neural networks. In this work, we used the pre-trained vectors trained
on 100 billion words from a Google News dataset
          <xref ref-type="bibr" rid="ref8">(Mikolov et al., 2013)</xref>
          . The
vocabulary size is 3 million words and the vector length is 300. One of the problem we had
to solve to include embeddings was to deal with keyphrases composed by more than
one term: vectors are linked to single words (or, in some cases to compound words or
terms).
          <xref ref-type="bibr" rid="ref5">(De Boom et al., 2016)</xref>
          showed that it’s possible to exploit the properties of
embeddings to represent sentences with the average, the max, or the min of the vectors
of the composing words. We chose to use the max.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Experiments and Results</title>
      <p>We carried out our experiments on the ScienceIE dataset6, consisting in a set of
450 articles collected from ScienceDirect, distributed among the domains Computer
Science, Material Sciences and Physics. The training set consists of 350 documents,
while the test set consists of 100 documents. The organizers also distributed 50
documents as development set, but we didn’t use these data. The task consisted in three
sub-tasks:
– A) Mention-level keyphrase identification;
– B) Mention-level keyphrase classification;
– C) Mention-level semantic relation extraction between keyphrases with the
same keyphrase types. Relation types are HYPONYM-OF and SYNONYM-OF.</p>
      <p>
        We consider in this paper only sub-task B), while for the evaluation, we refer to the
evaluation scenario in which the text is manually annotated and keyphrase boundaries
are given
        <xref ref-type="bibr" rid="ref1">(Augenstein et al., 2017)</xref>
        .
5. full list: https://github.com/snovd/corpus-data/blob/master/SemEval2017Task10/SynsetsRelatedToTrainingData.txt
6. https://scienceie.github.io/resources.html
6
      </p>
      <p>In Table 2 we show the results obtained with different combination of features,
compared to the best system at the SemEval 2017 ScienceIE (B subtask, with the
evaluation scenario where the keyphrase boundaries are given).</p>
      <sec id="sec-4-1">
        <title>P ROCESS</title>
        <p>:577
:728
:710
:701
:660</p>
      </sec>
      <sec id="sec-4-2">
        <title>M AT ERIAL</title>
        <p>:726
:750
:778
:764
:760</p>
        <p>From these results and the confusion matrices in Figure 3 it can be seen that
WordNet features are very helpful in discriminating the MATERIAL from the PROCESS
class, while the word embeddings features had a positive impact on the TASK class,
which was the most difficult one.</p>
        <p>a) Base features
b) Base features + WordNet
c) Base features + WordNet +embeddings
d) Base features + embeddings
!!!! Short title too long !!!!
7</p>
        <p>The confusion matrices show also that TASK is often confused with PROCESS,
which in turn seem to be too predominant, indicating a bias in the collection
towards this class. An analysis of the annotated collection showed certain
inconsistencies in annotations that may be at the origin of the errors: for instance, in file
2212667814000732.ann, we found a conflicting annotation for “synthetic
assessment method": alone is annotated as PROCESS, but the keyphrase “synthetic
assessment method based on cloud theory" is annotated as TASK, which seems odd.
In file S2212671612002351.ann, we found that “position estimation method" is
labelled as TASK, when it should instead be a process.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Conclusions</title>
      <p>We developed a method to classify keyphrases into a predefined set of categories
provided by the ScienceIE task at SemEval-2017. This method integrates external
knowledge, acquired either from an existing resource like WordNet or learned from a
large corpus of text and encoded using word embeddings, as features for a SVM
classifier. The obtained results outperform those obtained by the best system presented at
SemEval-2017. Our method presents margins of improvement, since some
parameters were chosen arbitrarily and further investigation is needed to discover the optimal
ones. We plan to exploit the domain of the document as an additional feature,
supposing that keyphrase styles may vary depending on the domain. The experiments
also highlighted some problems with the ScienceIE collection: on one side one of the
classes seems underrepresented and our analysis exposed a certain number of
annotation errors which may require a manual re-annotation.</p>
      <p>Acknowledgements</p>
      <p>This work has been partly supported by the program “Investissements d’Avenir"
overseen by the French National Research Agency, ANR-10-LABX-0083 (Labex
EFL).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Augenstein</surname>
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Das</surname>
            <given-names>M. K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Riedel</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vikraman</surname>
            <given-names>L. N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McCallum</surname>
            <given-names>A</given-names>
          </string-name>
          . (
          <year>2017</year>
          ,
          <article-title>August)</article-title>
          .
          <source>SemEval 2017 Task</source>
          <volume>10</volume>
          :
          <string-name>
            <surname>ScienceIE - Extracting Keyphrases</surname>
          </string-name>
          and
          <article-title>Relations from Scientific Publications</article-title>
          .
          <source>In Proceedings of the international workshop on semantic evaluation</source>
          . Vancouver, Canada, Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Bengio Y.</given-names>
            ,
            <surname>Schwenk</surname>
          </string-name>
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Senécal</surname>
          </string-name>
          <string-name>
            <given-names>J.-S.</given-names>
            ,
            <surname>Morin</surname>
          </string-name>
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Gauvain</surname>
          </string-name>
          <string-name>
            <surname>J.-L.</surname>
          </string-name>
          (
          <year>2006</year>
          ).
          <article-title>Neural probabilistic language models</article-title>
          . In D. E.
          <string-name>
            <surname>Holmes</surname>
          </string-name>
          , L. C. Jain (Eds.),
          <source>Innovations in machine learning: Theory and applications</source>
          , pp.
          <fpage>137</fpage>
          -
          <lpage>186</lpage>
          . Berlin, Heidelberg, Springer Berlin Heidelberg. Retrieved from http://dx.doi.org/10.1007/3-540-33486-
          <issue>6</issue>
          _
          <fpage>6</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Bjork B.-C.</surname>
          </string-name>
          ,
          <string-name>
            <surname>Roos</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lauri</surname>
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2009</year>
          ).
          <article-title>Scientific journal publishing: yearly volume and open access availability</article-title>
          .
          <source>Information Research: An International Electronic Journal</source>
          , Vol.
          <volume>14</volume>
          , No.
          <volume>1</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Chang C.-C.</surname>
          </string-name>
          ,
          <string-name>
            <surname>Lin C.-J.</surname>
          </string-name>
          (
          <year>2011</year>
          ).
          <article-title>LIBSVM: A library for support vector machines</article-title>
          .
          <source>ACM Transactions on Intelligent Systems and Technology</source>
          , Vol.
          <volume>2</volume>
          , pp.
          <volume>27</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>27</lpage>
          :
          <fpage>27</fpage>
          . (Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm)
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>De Boom C.</surname>
          </string-name>
          ,
          <string-name>
            <surname>Van Canneyt</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Demeester</surname>
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dhoedt</surname>
            <given-names>B.</given-names>
          </string-name>
          (
          <year>2016</year>
          ,
          <article-title>September)</article-title>
          .
          <article-title>Representation learning for very short texts using weighted word embedding aggregation</article-title>
          .
          <source>Pattern Recogn. Lett.</source>
          , Vol.
          <volume>80</volume>
          , No. C, pp.
          <fpage>150</fpage>
          -
          <lpage>156</lpage>
          . Retrieved from https://doi.org/10.1016/j.patrec.
          <year>2016</year>
          .
          <volume>06</volume>
          .012
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Hernandez S. D.</surname>
          </string-name>
          ,
          <string-name>
            <surname>Buscaldi</surname>
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Charnois</surname>
            <given-names>T.</given-names>
          </string-name>
          (
          <year>2017</year>
          ,
          <article-title>August)</article-title>
          .
          <source>LIPN at SemEval-2017 Task</source>
          <volume>10</volume>
          :
          <article-title>Filtering Candidate Keyphrases from Scientific Publications with Part-of-Speech Tag Sequences to Train a Sequence Labeling Model</article-title>
          .
          <source>In Proceedings of the international workshop on semantic evaluation</source>
          . Vancouver, Canada, Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Larsen P. O.</given-names>
            ,
            <surname>Von Ins</surname>
          </string-name>
          <string-name>
            <surname>M.</surname>
          </string-name>
          (
          <year>2010</year>
          ).
          <article-title>The rate of growth in scientific publication and the decline in coverage provided by science citation index</article-title>
          .
          <source>Scientometrics</source>
          , Vol.
          <volume>84</volume>
          , No.
          <issue>3</issue>
          , pp.
          <fpage>575</fpage>
          -
          <lpage>603</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Mikolov</surname>
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            <given-names>G. S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            <given-names>J</given-names>
          </string-name>
          . (
          <year>2013</year>
          ).
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          , pp.
          <fpage>3111</fpage>
          -
          <lpage>3119</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Miller G. A.</surname>
          </string-name>
          (
          <year>1995</year>
          ).
          <article-title>Wordnet: a lexical database for english</article-title>
          .
          <source>Communications of the ACM</source>
          , Vol.
          <volume>38</volume>
          , No.
          <volume>11</volume>
          , pp.
          <fpage>39</fpage>
          -
          <lpage>41</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Osborne F.</given-names>
            ,
            <surname>Motta</surname>
          </string-name>
          <string-name>
            <surname>E.</surname>
          </string-name>
          (
          <year>2015</year>
          ).
          <article-title>Klink-2: Integrating multiple web sources to generate semantic topic networks</article-title>
          .
          <source>In Proceedings of the 14th international conference on the semantic web - iswc</source>
          <year>2015</year>
          - volume
          <volume>9366</volume>
          , pp.
          <fpage>408</fpage>
          -
          <lpage>424</lpage>
          . New York, NY, USA, Springer-Verlag New York, Inc. Retrieved from http://dx.doi.org/10.1007/978-3-
          <fpage>319</fpage>
          -25007-6_
          <fpage>24</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>