<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>K-means and Hierarchical Clustering Method to Improve our Understanding of Citation Contexts</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marc Bertin</string-name>
          <email>bertin.marc@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Iana Atanassova</string-name>
          <email>iana.atanassova@univ-fcomte.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CRIT-Centre Tesniere, University of Bourgogne Franche-Comte</institution>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Centre Interuniversitaire de Rercherche sur la Science et la Technologie (CIRST), Universite du Quebec a Montreal (UQAM)</institution>
          ,
          <country country="CA">Canada</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper we focus of the clustering of citation contexts in scienti c papers. We use two methods, k-means and hierarchical clustering to better understand the phenomenon and types of citations and to explore the multidimensional nature of the elements composing the contexts of citations in di erent sections of the papers. We have analyzed a data set of seven peer-reviewed academic journals published by PLOS. The obtained clusters show that the Methods section is speci c in nature, regardless of the journal. A proximity between some of the journals can be observed.</p>
      </abstract>
      <kwd-group>
        <kwd>In-text References</kwd>
        <kwd>Bibliometrics</kwd>
        <kwd>Citation Analysis</kwd>
        <kwd>IMRaD Structure</kwd>
        <kwd>Text Mining</kwd>
        <kwd>K-means</kwd>
        <kwd>hierarchical clustering</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>A lot of research currently takes place around citation contexts in scienti c
papers. Although these themes are not recent, there is a renewed interest in this
eld with the implementation of di erent technics that come from text-mining.
The main challenge of these studies is to propose a method to analyze citation
contexts at a large scale taking into account various criteria.</p>
      <p>
        We propose a multidimensional approach to this problem which is based on
clusters. Clustering algorithms allow us to select similar contexts, that should be
considered members of a cluster. This study provides new results around citation
contexts and is related to two previous studies on similar problems [
        <xref ref-type="bibr" rid="ref2 ref3">3, 2</xref>
        ]. This
type of approach and techniques have direct applications for the processing of
texts from the social sciences (see [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]).
We know from previous studies that the rhetorical structure of papers must
be taken into account as it plays an important role in determining the types
of citation contexts (see e.g. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]). Furthermore, the speci c domains and topics
of the various journals, which also have their own editorial lines, can lead to
variations and have an e ect on the direct context of citations. For this reason, we
try to obtain, using a text mining approach, the sets and subsets for determining
the existence of di erent classes of contexts to produce a typology and better
understand this issue.
2.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>Dataset</title>
      <p>To perform this study we have analyzed a data set of seven peer-reviewed
academic journals published in Open Access by the Public Library of Science
(PLOS). Six of the journals are domain-speci c (PLOS Biology, PLOS
Computational Biology, PLOS Genetics, PLOS Medicine, PLOS Neglected Tropical
Diseases and PLOS Pathogens ) and the 7th is PLOS ONE, which is a general
journal that covers all elds of science and social sciences. We have used for
our experiments the entire data set of about 80,000 research articles in full text
published up to September 2013.</p>
      <p>
        The data set is in the XML JATS format, where the sections and paragraphs
that are identi ed as distinct XML elements, as well as the in-text references that
linked to the corresponding elements in the bibliography of the article. Various
aspects of the processing of this corpus and the distributions of in-text references
and thier contexts with respect to the IMRaD structure have been the object of
previous studies [
        <xref ref-type="bibr" rid="ref1 ref3">1, 3</xref>
        ].
2.2
      </p>
    </sec>
    <sec id="sec-3">
      <title>Protocol</title>
      <p>We have considered the articles in the corpus that are organised following the
IMRaD structure (Introduction, Methods, Results and Discussion). As this is
part of the editorial requirements of the journals, the vast majority of the papers
share this structure. For each journal and for each of the four section types, we
have extracted a random sample of 1000 sentences that contain in-text citations.
These sentences will be considered as citation contexts in our experiment.</p>
      <p>The pre-processing of the corpus consists in removing all punctuation marks
and numerical values so as not to introduce bias as to the overrepresentation of
the bibliographic references. We also used a stopword list and the terms obtained
were stemmatized.
2.3</p>
    </sec>
    <sec id="sec-4">
      <title>Hierarchical Clustering and K-means</title>
      <p>
        We analyse citation contexts using a multidimesional approach. We used two
complementary approaches, hierarchical clustering and K-means, that allow us
to better understand the phenomenon and types of citations and to explore the
multidimensional nature of the elements composing the contexts of citations. To
obtain the clusters we use the method of K-means (see [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ]). From an
application point of view, this work is based on the use of an R library [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>From a methodological point of view, we have used popular partitioning
method: K-means clustering. However, this approach requires to know the exact
number of clusters. To determine this number, we have used a graph that shows
the relation between the number of extracted clusters and their distance. This
method allows us to obtain the value and to generate the clusters using the
Kmeans method. The correct choice of k can be ambiguous. The di culty arises
from the fact that we have to nd a balance between the shape and scale of the
data set and the number of clusters that the user wants to obtain. We propose
in this paper two approaches for diagnosing the number of clusters suitable
for the data (see gure 1 and 2). The result on gure 1 uses elbow with the
sum of squared error and gure 2 uses Calinsky criterion with an interval for
groups between one and ten. The Elbow method is a method for interpreting
and validating coherence in clusters to nd the appropriate number of clusters
in a data set.
3</p>
      <sec id="sec-4-1">
        <title>Results</title>
        <p>
          The results obtained from the K-means clustering and Hierarchical Clustering
are respectively presented in the gures 3 and 4. Figure 3 shows an analysis
of the data set with an arbitrary choice of the number of clusters k = 4. The
names of the journals are coded using 4 characters, followed by one character (i,
m, r or d) that corresponds to the section type. This gure shows the speci c
character of the Methods section (at the left), con rming earlier works [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] that
underline the atypical nature of this section in terms of citation contexts. Indeed,
this cluster shows that the Methods section is located in a single cluster and this
regardless of the journal. The journal ppat (PLOS Pathogens) is also in one
unique cluster that contain the di erent sections of the rhetorical structure of
the component. In addition, pntd (PLOS Neglected Tropical Diseases) and pmed
(PLOS Medicine) are both located in one and the same cluster.
        </p>
        <p>The hierarchical clustering allows us to illustrate the hierarchical
organisation of groups as shown on the gure 4. This visualization con rms the previous
result, but o ers also a hierarchical view of the clusters. The hierarchical analysis
highlights the speci c character of the Results sections in pmed and pntd that are
in the same group as the Introduction and Discussion sections in these journals.
Again, ppat appears in a separate cluster. The journal pcbi (PLOS
Computational Biology) is also in a separate sub-cluster that suggests that the citation
contexts in this journal are quite di erent from those in the other journals.
4</p>
      </sec>
      <sec id="sec-4-2">
        <title>Discussion and Conclusion</title>
        <p>
          This work emphasizes the need for a tool to identify and analyze the contexts of
citations, while being aware of the multidimensional nature of these phenomena.
Indeed, we have not yet addressed the problem of the categorization of
quotations. Our approach aims to determine the clusters of citation contexts, so that
at a di erent stage the topics will have to be identi ed and processed for a ner
analysis. In order to do this, we can consider for example the syntax of citation
contexts and the topics, as already proposed in [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. One of the advantages of
using the topic modeling approach is the possibility to deal with large volumes
of textual data.For example, this type of approach has already been used in the
political text study [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
        </p>
        <p>Studying the structure of scienti c papers and observing the regularities in
the contexts of in-text citations is an important step towards understanding
the phenomenon of citation which is central in the process of building scienti c
knowledge. Di erent types of citations exist based on the motivation to cite and
the relation between the citing authors and the cited work. To be able to create
an ontology of citations that re ects the types of citations found in articles it
is necessary to process existing corpora and study the properties of citation
contexts on a large scale.</p>
      </sec>
      <sec id="sec-4-3">
        <title>Acknowledgments</title>
        <p>We thank Benoit Macaluso of the Observatoire des Sciences et des Technologies
(OST), Montreal, Canada, for harvesting and providing the PLOS data set.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bertin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Atanassova</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>A study of lexical distribution in citation contexts through the IMRaD standard</article-title>
          .
          <source>In: Proceedings of the First Workshop on Bibliometric-enhanced Information Retrieval co-located with 36th European Conference on Information Retrieval (ECIR</source>
          <year>2014</year>
          ). pp.
          <volume>5</volume>
          {
          <fpage>12</fpage>
          . Amsterdam,
          <source>The Netherlands (April 13</source>
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bertin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Atanassova</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Factorial correspondence analysis applied to citation contexts</article-title>
          .
          <source>In: Proceedings of the First Workshop on Bibliometric-enhanced Information Retrieval co-located with 37th European Conference on Information Retrieval (ECIR</source>
          <year>2015</year>
          ). Vienna,
          <source>Austria (March 29</source>
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bertin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Atanassova</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Larivire</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gingras</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>The invariant distribution of references in scienti c papers</article-title>
          .
          <source>Journal of the Association for Information Science and Technology</source>
          <volume>67</volume>
          (
          <issue>1</issue>
          ),
          <volume>164177</volume>
          (
          <year>January 2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. Gri ths, T.L.,
          <string-name>
            <surname>Steyvers</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blei</surname>
            ,
            <given-names>D.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tenenbaum</surname>
            ,
            <given-names>J.B.</given-names>
          </string-name>
          :
          <article-title>Integrating topics and syntax</article-title>
          .
          <source>In: NIPS</source>
          . vol.
          <volume>4</volume>
          , pp.
          <volume>537</volume>
          {
          <issue>544</issue>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Hartigan</surname>
            ,
            <given-names>J.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wong</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          :
          <article-title>Algorithm as 136: A k-means clustering algorithm</article-title>
          .
          <source>Journal of the Royal Statistical Society</source>
          . Series C (Applied Statistics)
          <volume>28</volume>
          (
          <issue>1</issue>
          ),
          <volume>100</volume>
          {
          <fpage>108</fpage>
          (
          <year>1979</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Kintigh</surname>
            ,
            <given-names>K.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ammerman</surname>
            ,
            <given-names>A.J.:</given-names>
          </string-name>
          <article-title>Heuristic approaches to spatial analysis in archaeology</article-title>
          .
          <source>American</source>
          Antiquity pp.
          <volume>31</volume>
          {
          <issue>63</issue>
          (
          <year>1982</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Lucas</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nielsen</surname>
            ,
            <given-names>R.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roberts</surname>
            ,
            <given-names>M.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stewart</surname>
            ,
            <given-names>B.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Storer</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tingley</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sinclair</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blattman</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corstange</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Humphreys</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jamal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>King</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Milner</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mitts</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          , O 'connor,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Spirling</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          :
          <article-title>Computer-Assisted Text Analysis for Comparative Politics</article-title>
          .
          <source>Advance Access publication February Political Analysis</source>
          <volume>4</volume>
          (
          <issue>23</issue>
          ),
          <volume>254</volume>
          {
          <fpage>277</fpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Roberts</surname>
            ,
            <given-names>M.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stewart</surname>
            ,
            <given-names>B.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Airoldi</surname>
            ,
            <given-names>E.M.:</given-names>
          </string-name>
          <article-title>A model of text for experimentation in the social sciences</article-title>
          .
          <source>Journal of the American Statistical Association</source>
          <volume>111</volume>
          (
          <issue>515</issue>
          ),
          <volume>988</volume>
          {
          <fpage>1003</fpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Roberts</surname>
            ,
            <given-names>M.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stewart</surname>
            ,
            <given-names>B.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tingley</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <article-title>: stm: R package for structural topic models</article-title>
          .
          <source>R package version 0.6 1</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>