<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Incorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Masaki Eto</string-name>
          <email>masaki.eto@gakushuin.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Gakushuin Women's College Tokyo</institution>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <fpage>30</fpage>
      <lpage>35</lpage>
      <abstract>
        <p>To improve the search performance of retrieval methods using cocitation linkages, this study proposes a technique to enlarge a co-citation network by incorporating satellite documents. This technique specifies satellite documents via full-text searches for terms obtained from documents having cocitation linkages with a seed document; the appropriateness of each co-citation linkage is checked by using the strength of the co-citation context based on the results of parsing documents that cite the seed document. This study evaluates search performance using the proposed technique with IR experiments. Specifically, the random walk with restart algorithm, which can compute similarities between the seed document and each document in the network, is applied to the enlarged and initial networks. Scores of the normalized discounted cumulative gain (nDCG@K) were then compared. The results indicate that the search performance of the retrieval methods using the enlarged network outperforms those of a baseline method using the initial network.</p>
      </abstract>
      <kwd-group>
        <kwd>Co-citation</kwd>
        <kwd>Context</kwd>
        <kwd>TF-IDF</kwd>
        <kwd>Random walk with restart</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        In the field of scientific paper searches, citations are often used to measure implicit
relationships between documents. One approach to improve the search performance
of retrieval methods using citation linkages is to enlarge the citation networks by
incorporating additional information. In the case of a network created using direct
citation linkages, i.e., the linkages between the citing and cited documents, techniques to
enlarge the network of citations on the basis of additional information, such as citing
text [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] or user profiles [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], have been reported.
      </p>
      <p>
        This study enlarges the networks connected by co-citations. A co-citation is
defined as a linkage between a pair of documents concurrently cited by a third
document. In the simplest retrieval method using co-citation, documents having a
cocitation relationship with a given seed document that are known to be relevant are
presented to the user under the assumption that documents co-cited with such a seed
document tend to be topically similar to the seed document. Co-citation networks
have been used in bibliometrics and can also be applied to scientific paper searches
(e.g., [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]).
      </p>
      <p>This study proposes a technique to enlarge the co-citation network by adding
word-based linkages. When documents are detected by the co-citation linkage, it is
possible to obtain more appropriate search terms from the document; such terms may
not have been included in the original seed document. A set of new search terms may
yield additional relevant documents that were not identified simply by the co-citation
linkages or the user’s original representation of his or her information needs. This
study defines satellite documents as documents that are specified via full-text searches
for new search terms. The purpose of the proposed technique is to incorporate these
satellite documents into the initial network of documents, which is already connected
by co-citation linkages.</p>
      <p>
        In addition, the proposed technique attempts to reduce noise satellite
documents incorporated into the initial co-citation network using the co-citation context.
Some studies (e.g., [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]) have reported that using the contexts of co-citations
has positive effects for reducing noise documents when co-citation networks are
enlarged by additional co-citation linkages; therefore, it is feasible to use co-citation
contexts when enlarging co-citation networks by adding word-based linkages.
      </p>
      <p>
        This study empirically evaluates the search performance of retrieval methods
using the proposed technique with IR experiments. Specifically, the random walk
with restart (RWR) algorithm [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], which can compute similarities between the seed
document and each document in the network, is applied to enlarged networks and
initial networks, and the results are compared by computing scores of the cutoff
version of the normalized discounted cumulative gain (nDCG@K).
2
2.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>PROPOSED TECHNIQUE</title>
      <sec id="sec-2-1">
        <title>Specifying satellite documents</title>
        <p>Figure 1 shows an initial network comprising document nodes connected by
undirected co-citation linkages. In this network, a search query is a seed document that is
known to be relevant to the information needs of a user. The weight of the edge, w,
i.e., the strength of the co-citation linkage, is computed as
w ( 1,  2) = cociting ( 1,  2). (1)</p>
        <p>Here, d1 and d2 are co-cited documents and cociting(d1, d2) denotes the total
number of documents co-citing d1 and d2 in the target document set. Note that this
study denotes a weighted edge between d1 and d2 as Edge (d1, d2, w).</p>
        <p>Satellite
documents of C1</p>
        <p>A</p>
        <p>Seed
Satellite
documents of C3
new T1 T2 T3 existing C3 E1 E2</p>
        <p>The proposed technique specifies satellite documents by investigating
documents one hop from the seed. This study defines host documents as source documents
that are used to specify satellite documents. Using the title words of the host
document as a query, the satellite documents are specified on the basis of a standard
fulltext search method; the seed document is excluded from the search target. For
example, in Figure 1, Document C1, a host document that is one hop from the seed,
specifies six satellite documents. In the experiments in this study, the tf-idf retrieval
function of the Indri search engine, which has been developed as part of the Lemur
Project, was used. The top N documents ranked by this full-text search were adopted as
satellite documents (e.g., N = 10).</p>
        <p>
          In addition, as an optional process, the proposed technique attempts to check
the appropriateness of each host document as a source because inappropriate host
documents may yield noise. To check appropriateness, this technique uses the
strength of co-citation context (see e.g., [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] and [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]) identified by parsing the full-text
of documents that cite the seed and each host. More specifically, this technique
examines reference positions within the text and if references to both the document and the
seed appear within a paragraph in one or more citing documents, the document is
selected as a host document because a seed and host co-cited in a strong context are
expected to be closely related. For example, in Figure 1, if one or more documents
cite Documents A and C3 in the same paragraph, Document C3 would be selected as a
host document. Conversely, if no documents cite them in the same paragraph,
Document C3 would not be selected as a host document.
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Incorporating satellite documents</title>
        <p>If a satellite document is new, a new node is created with an undirected edge of
weight 1 connecting the new node to its host. When two host documents share a new
satellite document, one new node and two edges between the new node and each host
node are created. In Figure 1, Document T3 is specified by host Documents C1 and C3;
therefore, a new node T3, Edge (T3, C1, 1), and Edge (T3, C3, 1) are created. In
addition, if a new document has co-citation linkages with documents already existing in
the initial network or with other new documents, new edges are created and weights
are assigned using Eq. (1).</p>
        <p>When a satellite document already exists in a given network, the linkage
between the satellite document and its host is used to create a new edge or recalculate
the weight of a given edge. If the linkage is new for the network, an undirected edge
of weight 1 is created between the satellite and its host. If the linkage already exists in
the initial network, the weight of the existing edge is recalculated as
w ( 1,  2) = cociting ( 1,  2) + 1. (2)</p>
        <p>Some new linkages may be duplicated in the specified results. In such cases,
the proposed technique treats them as one combined link and creates one new edge.
For example, in Figure 1, Document C1 has satellite document C3 and vice versa;
therefore, only Edge (C1, C3, 1) is incorporated into the network.
2.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Ranking the documents in the network</title>
        <p>
          To calculate document scores, the RWR algorithm is applied to the enlarged network.
This algorithm iteratively investigates the entire network, and the similarity between a
seed node and each node in the network is calculated (see, e.g., [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] and [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]).
Specifically, the walker starts at a seed node and then either proceeds to the connected nodes
on the basis of a probability calculated by weights or returns back to the seed node;
these steps are repeated iteratively until convergence. The long-term visit rate of each
node is used as a document score; these rates are given by the steady state of
 ⃗ = (1 −  ) ̃ ⃗ +   ⃗. (3)
        </p>
        <p>Here,⃗ is an n-dimensional vector (with n being the number of nodes in the
network),⃗ is an n-dimensional vector with 1 for the seed node and 0 for the others,
and r is a return probability. This study uses the following 11 values of r in the
experiments: 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, and 0.99. Alsõ, is a transition
probability matrix, and each transition probability between two nodes is the weight of
an edge, which is normalized by the summation of the weights of the edges connected
to the current node. In the case shown in Figure 1, the probability “A to C1” is 0.286
given as 2/(2+3+2). Thereforẽ, is an asymmetric matrix, i.e., one direction can be
different from another direction, e.g., the probability “C1 to A” is not equal to 0.286.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>EXPERIMENTAL SETUP</title>
      <p>As described in Section 2.1, the proposed technique has an optional process.
Therefore, this study evaluates the search performance of two retrieval methods. First,
Proposed (all) omits the optional process and simply identifies all documents one hop
from the seed as host documents. Second, Proposed (context) selects host documents
using the strength of the co-citation context. For both retrieval methods, the parameter
N (i.e., the number of retrieved documents per host document) was set to 10 and 100.
In addition, the study evaluates the search performance of a baseline method that
applies the RWR algorithm only to the initial co-citation network. In this experiment,
the three retrieval methods take up to two hops from the seed to create each initial
cocitation network; three or more hops are out of scope.</p>
      <p>To create a special test collection, the Open Access Subset of PubMed Central
was used. The test collection was constructed by selecting approximately 152,000
documents from the subset with the condition that the document had at least one
citation linkage with a document in the subset. The test collection contained 100 seed
documents that were randomly selected from all the documents under the condition
that each seed document had co-citation linkages with 10 or more documents.</p>
      <p>In addition, this experiment adopted nDCG@K as a metric to evaluate the
search performance (with K = 5, 10, 50, and 100). A document was considered
relevant depending on the degree to which it shared MeSH Descriptors with the target
seed document. More specifically, the Jaccard coefficient (JC) was used, i.e., when
nDCG was calculated, the experiment used a relevance score of 3 for documents
whose JC was 0.3 or more, 2 for documents whose JC was 0.2–0.3, and 1 for
documents whose JC was 0.1–0.2.
4</p>
    </sec>
    <sec id="sec-4">
      <title>RESULTS</title>
      <p>Search runs for 100 seed documents were executed using each method.
4.1</p>
      <sec id="sec-4-1">
        <title>Evaluation of incorporated documents</title>
        <p>First, the experiment examined whether the newly incorporated documents were
relevant (see Figure 1). Table 1 shows the average number of relevant incorporated
documents; a document is relevant if the JC is 0.1 or more. Further, Table 1 lists the
average ratio of the relevant documents, which is the total number of relevant
documents over 100 search runs divided by the total number of new documents over the
100 search runs.</p>
        <p>As shown in the table, the numbers of relevant documents were relatively
large. For example, Proposed (all) incorporated more than 50 new relevant
documents per seed. Therefore, the proposed technique has the potential to improve the
search performance.</p>
        <p>Further, the ratio of relevant documents for Proposed (context) was higher
than that of Proposed (all). This result indicates that the checking process using the
co-citation context tends to exclude inappropriate host documents.</p>
        <p>Proposed N = 10
K Baseline (r ) all (r ) context (r )
5 0.226 (0.9) 0.226 (0.99) 0.232* (0.99)
10 0.223 (0.99) 0.221 (0.99) 0.227** (0.99)
50 0.188 (0.99) 0.191* (0.99) 0.189** (0.99)
100 0.174 (0.99) 0.181** (0.99) 0.177* (0.99)</p>
        <p>Propsed N = 100
all (r ) context (r )
0.224 (0.9) 0.234** (0.9)
0.226 (0.99) 0.230** (0.99)
0.197** (0.99) 0.191 (0.99)
0.188** (0.99) 0.180** (0.99)</p>
        <p>* P &lt; 0.05, ** P &lt; 0.01</p>
        <p>In Table 2, the maximum scores of the five retrieval results at each K are
shown in bold. These are the results of Proposed (context) and Proposed (all) with N
= 100, with the paired t-tests showing statistically significant differences. Therefore,
the retrieval methods using the proposed technique tended to outperform the baseline
method.</p>
        <p>Furthermore, the scores of Proposed (context) were higher than those of the
baseline method in all cases, with the paired t-tests indicating a statistically significant
difference in most cases. Conversely, some scores of Proposed (all), i.e., with N = 10
at K = 10 and with N = 100 at K = 5, were lower than those of the baseline method.
This suggests that the checking process had a stable and positive impact on improving
the search performance.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>CONCLUSION</title>
      <p>This study proposed a technique to enlarge co-citation networks by incorporating
satellite documents in scientific paper searches. Retrieval methods using the proposed
technique tended to outperform the baseline method, which was based on the initial
co-citation network.
6
7</p>
    </sec>
    <sec id="sec-6">
      <title>ACKNOWLEDGMENTS REFERENCES</title>
      <p>This work was supported by JSPS KAKENHI Grant Number JP26730163.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>He</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pei</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Kifer</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mitra</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Giles</surname>
          </string-name>
          . C. L.
          <article-title>Context-aware citation recommendation</article-title>
          .
          <source>In Proceedings of the 19th International World Wide Web Conference (WWW2010)</source>
          ,
          <fpage>421</fpage>
          -
          <lpage>430</lpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Sugiyama</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Kan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>Exploiting Potential Citation Papers in Scholarly Paper Recommendation</article-title>
          ,
          <source>In Proceedings of the 13th ACM/IEEE Joint Conference on Digital Libraries (JCDL</source>
          <year>2013</year>
          ),
          <fpage>153</fpage>
          -
          <lpage>162</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Eto</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>Document retrieval method using random walk with restart on weighted cocitation network</article-title>
          ,
          <source>In Proceedings of the 77th ASIS&amp;T Annual Meeting</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Eto</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>Spread co-citation relationship as a measure for document retrieval</article-title>
          .
          <source>Proceedings of the fifth ACM workshop on Research advances in large digital book repositories and complementary media</source>
          , 7-
          <fpage>8</fpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Tong</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Faloutsos</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Pan</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2008</year>
          .
          <article-title>Random walk with restart: fast solutions and applications</article-title>
          .
          <source>Knowledge and Information Systems</source>
          ,
          <volume>14</volume>
          ,
          <issue>3</issue>
          ,
          <fpage>327</fpage>
          -
          <lpage>346</lpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Gipp</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Beel</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <article-title>Citation proximity analysis (CPA) - A new approach for identifying related work based on co-citation analysis</article-title>
          .
          <source>In Proceedings of the 12th ISSI Conference</source>
          .
          <volume>2</volume>
          ,
          <fpage>571</fpage>
          -
          <lpage>575</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Eto</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>Evaluations of context-based co-citation searching</article-title>
          ,
          <source>Scientometrics</source>
          <volume>94</volume>
          ,
          <issue>2</issue>
          ,
          <fpage>651</fpage>
          -
          <lpage>673</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Gori</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Pucci</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <article-title>Research paper recommender systems: A random-walk based approach</article-title>
          .
          <source>In Proceedings of IEEE/WIC/ACM Web Intelligence</source>
          ,
          <fpage>778</fpage>
          -
          <lpage>781</lpage>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>