<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Science Graph for characterizing the recent scienti c landscape</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Takahiro Kawamura</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Katsutaro Watanabe</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Naoya Matsumoto</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shusaku Egami</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mari Jibu</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>Maps of science representing the structure of science can help us understand science and technology development. However, navigating the recent scienti c landscape is still challenging, since application of inter-citation and co-citation analysis for ongoing projects and recently published papers has difficulty. Therefore, in order to characterize what is being attempted in the current scienti c landscape, this paper proposes a content-based method of locating research projects in a multi-dimensional space using word/paragraph embedding techniques. The proposed method successfully formed a science graph with 78% accuracy from 25,607 project descriptions of the 7th Framework Programme (FP7) from 2006 to 2016.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Research in scientometrics has developed techniques for analyzing research
activities and for measuring their relationships, and then constructed maps of science
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], one of the major topics in scientometrics. Maps of science have been useful
tools for understanding the structure of science, their spread, and
interconnection of disciplines. However, conventional approaches to understanding research
activities focus on what authors tell us about past accomplishments through
inter-citation and co-citation analysis of published research papers. Therefore,
this paper focuses on what researchers currently want to work on thier research
projects. Project descriptions, however, do not have references and can not be
analyzed using citation analysis; thus, we propose to analyze them using a
contentbased method using natural language processing (NLP) techniques. Then, we
created a science graph, which is a knowledge graph representing the recent
scienti c trends, where nodes represent research projects that are linked by certain
distances of the content similarity and their semantics.
Some studies have examined automatic topic classi cation based on content
using lexical approaches such as probabilistic latent semantic analysis (pLSA) and
latent Dirichlet allocation (LDA). One uses LDA to nd the ve most probable
words for a topic, and each document is viewed as a mixture of topics. Thus,
this approach can classify documents across different agencies and publishers.
However, the relationship between any project and article, such as that involving
their distance or semantics, cannot be computed directly.
      </p>
      <p>
        By contrast, Le and Mikolov [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] proposed a paragraph vector that learns
xed-length feature representations using a two-layered neural network from
plain texts, such as sentences, paragraphs, and documents. A paragraph vector
is considered another word in a paragraph and is shared across all contexts
generated from the same paragraph but not across paragraphs. The paragraph
vectors are computed by xing the word vectors and training the new paragraph
vector until convergence. By considering word order, paragraph vectors can also
address the weaknesses of bag-of-words models in LDA and pLSA.
3
      </p>
    </sec>
    <sec id="sec-2">
      <title>Measurement of Project Relationships</title>
      <p>In this study, we analyzed project descriptions from FP7. Precisely, our
experimental data set consisted of the titles and descriptions of 25,607 FP7 projects
from 2006 to 2016, including 305,819 sentences in total. All words in the
sentences were tokenized and lemmatized before creating the vector space.</p>
      <p>Firstly, we constructed paragraph vectors for 25,607 FP7 projects using the
current paragraph embedding technique. The hyperparameters were set
empirically as follows: 500 dimensions were established for 66,830 words that appeared
more than ve times; the window size c was 10, and the learning rate and
minimum learning rate were 0.025 and 0.0001, respectively, with an adaptive gradient
algorithm. The learning model is a distributed memory model with hierarchical
softmax. As a result, we found that projects are scattered and not clustered by
any subject or discipline in the vector space. Most projects are slightly connected
to a low number of projects. Thus, it is difficult to grasp trends and compare
an ordinary classi cation system such as SIC codes. Closely observing the
vector space reveals some of the reasons for this unclustered problem: each word
with nearly the same meaning has slightly different word vectors, and shared
but unimportant words are considered the commonality of paragraphs.
Therefore, for addressing this problem, we introduce an entropy-based method for
clustering word vectors before constructing paragraph vectors.</p>
      <p>To unify word vectors of almost the same meaning, excluding trivial common
words, we generated cluster vectors of word vectors based on the entropy of each
concept in a thesaurus. We calculated the information entropy of each concept
in the FP7 projects. Next, after creating clusters according to the degree of
entropy, we uni ed all word vectors in the same cluster to a cluster vector and
constructed paragraph vectors based on the cluster vectors. The overall ow is
shown in Fig. 1.</p>
      <p>n m m
H(C) = ∑(∑ p(Sij jC) log2 ∑ p(Sij jC))) (1)
i=0 j=0 j=0</p>
      <p>
        Shannon's entropy in information theory is an estimate of event
informativeness. Given that a thesaurus consists of terms Ti, we calculated the entropy of a
concept C by considering the appearance frequencies of a hypernym T0 and its
hyponyms T1:::Tn as an event probability. The frequencies of synonyms Si0:::Sim
of term Ti were summarized to a corresponding concept (synonyms Sij include
descriptors of terms Ti themselves). In Eq. (1), p(Sij jC) is the probability of a
synonym Sij given a concept C and terms Ti. For each concept in the thesaurus,
Science Graph for characterizing the recent scienti c landscape
we calculated the entropy H(C) in the FP7 data set. As the probabilities of
events become equal, H(C) increases. If only particular events occur, H(C) is
reduced because of low informativeness. Thus, the proposed entropy of a concept
increases when a hypernym and hyponyms that construct a concept separately
appear with a certain frequency in the data set. Therefore, the degree of entropy
indicates the semantic diversity of a concept. Then, assuming that the degree of
entropy and the spatial size of a concept in a word vector space are proportional
to a certain extent, we split the word vector space into clusters. In fact, our
preliminary experiment indicated that the entropy of a concept has high correlation
R = 0.602 with the maximum Euclidean distance of hyponyms in the concept
in a vector space, at least while the entropy is rather high. The vector space is
subdivided into clusters proportionally to the ratio of the highest two concept
entropies. Each cluster is subdivided until the entropy becomes lower than 0.25
(the top 1.5% of entropies) or the number of elements in a cluster is lower than
10. These parameters were also determined empirically through the experiments.
After generating 1,260 cluster clusters from 66,830 word vectors, we considered
the centroid of all vectors in a cluster as a cluster vector. Then, we obtained
paragraph vectors by calculating the maximum likelihood of L in Eq. (2), which
is an extension of the paragraph embedding de ned in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Cl(w) means a
cluster vector to which a word w belongs, and di is a vector for a paragraph i that
includes wt. T is the number of words with a certain usage frequency in the
corpus. Using high-entropy concepts in scienti c and technological contexts as
common points with each paragraph vector (excluding trivial words), paragraph
vectors can comprise meaningful groups in the vector space.
      </p>
      <p>T
L = ∑ log p(Cl(wt)jCl(wt c); :::; Cl(wt+c); di) (2)</p>
      <p>t=1
4</p>
    </sec>
    <sec id="sec-3">
      <title>Experiments and Evaluation</title>
      <p>The science graph for FP7 is publicly accessible at http://togodb.jst.go.jp/
sample-apps/map_FP7/ (click \Drawing Map" button. see the CORDIS
website for subject codes). The distributed recursive graph layout (DrL) algorithm,
which produces edge-weighted force directed graphs, was used to visualize the
relationships between projects. We computed 328 million cosine similarities for
all pairs of the 25,607 projects; however, we kept only those that were above a
given threshold (0.35 in this case) as edges due to visualization limitation.</p>
      <p>In terms of the unclustered problem, we con rmed that the proposed method
successfully formed several clusters compared with the baseline method, in
comparison with the relationships between the cosine similarities and the number of
edges, and the relationship between degree centrality and the number of nodes.</p>
      <p>However, since there is no gold standard for evaluating the distance among
research projects, we evaluated the accuracy of the similarities based on a sampling
method. We randomly extracted 100 pairs of projects with a cosine similarity
of &gt; 0.5, to make the distribution similar to the entire distribution. Each pair
has two project titles and descriptions, and a cosine value that is divided into
three levels: weak (0.5 cos. &lt; 0.67), middle (0.67 cos. &lt; 0.84), and strong
(0.84 cos.). Then, three members of our organization, a funding agency in
Japan, evaluated the similarity of each pair. The members were provided the
prior explanations for the intended use of the graph, some examples of
evaluation and the same evaluation data. As a result, we con rmed that 78% of the
project similarities (i.e., the distances in the graph) matched majority votes of
the members' opinions. Examples misjudged include, e.g., two projects using
lots of homonyms with high cos values and two projects which accidentally have
some similar sentences with cos values just over the threshold, and so forth. By
contrast, the accuracy of the distances in the baseline was 21%. The evaluation
results were determined to be in \fair" agreement (Fleiss' Kappa = 0.29).
5</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion and Future Work</title>
      <p>
        Since funding projects do not have references and also recently published articles
do not have enough citations yet, we assessed the relationships using a
contentbased method, instead of citation analysis. At the back end of the graph,
bibliographic information is stored as our Linked Data database [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], and they are
mainly connected by similarTo with similarity values and by hasConcept with
common concept classes.
      </p>
      <p>As the next step, we will extract new insights from the science graph of
research projects, especially in comparison with previous maps of science based
on citation analysis of published papers.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Boyack</surname>
            ,
            <given-names>K.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klavans</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Borner</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>\Mapping the backbone of science,"</article-title>
          <source>Scientometrics</source>
          ,
          <volume>64</volume>
          (
          <issue>3</issue>
          ), pp.
          <volume>351</volume>
          {
          <issue>74</issue>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Le</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Mikolov</surname>
          </string-name>
          , T.:
          <article-title>\Distributed Representations of Sentences and Documents,"</article-title>
          <source>Proc. of ICML</source>
          <year>2014</year>
          ,
          <volume>32</volume>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Kimura</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kawamura</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Watanabe</surname>
          </string-name>
          , et al.:
          <article-title>\J-GLOBAL knowledge: Japan's Largest Linked Open Data for Science and Technology,"</article-title>
          <source>Proc. of ISWC</source>
          <year>2015</year>
          ,
          <article-title>Poster</article-title>
          &amp; Demo
          <string-name>
            <surname>Track</surname>
          </string-name>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>