<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>BIR</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Intrication between Information Retrieval and Bibliometrics: the case of scienti c domain delineation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Michel Zitt</string-name>
          <email>mzitt@numericable.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Lereco Lab (retired), National Institute for Agronomic Research (INRA)</institution>
          ,
          <addr-line>Dept SAE2, Nantes</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <volume>14</volume>
      <abstract>
        <p>This invited paper summarizes a recent synthesis [8] on the topic of scienti c domain delineation. See also [5].</p>
      </abstract>
      <kwd-group>
        <kwd>Scienti c Domain Delineation</kwd>
        <kwd>Information Retrieval</kwd>
        <kwd>Bibliometrics</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>For decades, Bibliometrics and Information Retrieval developed an
increasingly close relationship. Information retrieval (namely its application to science
and technology) rst emphasized data organization, indexing and query systems.</p>
      <p>Bibliometrics, driven by exploration of science networks (Price) or evaluation of
research systems, was boosted by Gar eld's \citation index", a co-word which
symbolizes the match between IR and bibliometrics. Kessler's work was a beacon
of their opening to mapping techniques.</p>
      <p>Bibliometric investigation, whether for cognitive or for evaluative purposes,
typically targets particular \domains" de ned at some scale. The meso-level
scope covers sub-disciplines, elds, large research areas; the frontier with the
micro-level of research fronts or topics is quite fuzzy. Field delineation is
usually understood in terms of publication sets. Delineation requirements may be
quite basic or more demanding, depending on the study's ambitions and the
type of domain. High-level requirements are associated to large projects on
\dif</p>
      <p>cult" elds such as emerging, transversal or complex areas. Operationalization
of delineation typically exploits three models, on their own or in combination.</p>
      <p>All of them rely on quantitative methods in statistics-probability, data analysis
(clustering and factor techniques) and graph-network theory.</p>
      <p>Model A covers institutional classi cations and nomenclatures that stem
from the cooperation between scientists, S&amp;T policy experts, librarians and
information scientists. Examples: aggregate classi cation from a few national and
international bodies; national or international patent o ce classi cations;
classi cations from producers of disciplinary or general databases | with various
degrees of disaggregation. A distinction should be made between institutional,
national and international schemes of classi cation. Schemes that are linked to
national or institutional evaluation, especially, are politically loaded and even
\international" standards may be biased towards the points of view of
stakeholders. In this respect, the way in which disciplines are grouped or broken down
bears witness to social and national stakes at a given period. Given a
bibliometric project, prede ned categories sometimes happen to meet the requirements,
although this is not generally the case on transversal or emerging areas.</p>
      <p>Model B, in contrast with this sedimented knowledge, relies on ad-hoc IR
investigations in a particular query system. In some cases, this appears as a
re</p>
      <p>nement brought to existing groupings. For example, whereas some databases
o er categories based on journals lists, variants may be established using some
expertise | the same goes for patent categories. It is quite common to delineate
relatively simple domains in this way, keeping in mind the limitations of
Bradfordian ranked lists. A better compromise requires lexical query (or citation)
facilities. A domain of science or technology is seldom captured by global terms
so that it becomes necessary to combine restricted queries, using the various
adaptive techniques of up-to-date IR, in order to reach reasonable precision and
recall.</p>
      <p>Model C is based on bibliometric mapping and clustering, which makes
\invisible colleges" visible through a variety of bibliometric networks (actors,
texts, citations, etc.). A domain is then expected to emerge as a delimited area
on a map or graph at the proper scale. In practice, the mental expectations of
users seldom match a single deus ex machina mapping exercise, which inevitably
conveys implicit points of view and technical artefacts. Variable adaptive
combinations of top-down phases (cutting the domain out of an extended map of
science) and/or bottom-up ones (aggregation of themes in low-level maps) will
help. Mapping techniques exploit developments in clustering and community
detection (graph unfolding: Louvain, Infomap, SLMA. . . ; spectral methods: LSA,
pLSA, LDA. . . ) as well as classical clustering and factor analysis.</p>
      <p>The modes of supervision (by commissioners, bibliometricians, scientists
from the domain. . . ) are crucial to validate the ndings at various stages. Model A
mostly uses \frozen" IR knowledge, which does not spare discussion on its
pertinence in the particular case. Model B queries require experts' mental
representation of the eld, if only to avoid missing signi cant subareas. Model C has to
deal with peculiarities of the mapping methods. Border areas, particularly, need
supervision.</p>
      <p>Multi-networks. Scienti c IR as well as bibliometrics bring into play the
various networks that are explicitly or implicitly created by published S&amp;T:
actors, papers, texts, citations, classi cation categories, sometimes funding
information, etc. Citation certainly remains the iconic network of bibliometrics, while
IR tradition mostly exploited nomenclature-based indexes and lexical terms. The
Internet era shu ed the cards, with the rise of webometrics. Quasi-citations
(URL linkages) migrated from journal bibliometrics to Google; PageRank then
irrigated webometrics and bibliometrics. The cognitive theory of scienti c
inforCopyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).</p>
      <p>BIR 2020, 14 April 2020, Lisbon, Portugal.</p>
      <p>
        130
mation [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] relies on a multinetwork universe. The associated poly-representation
of literature o ers to combine or confront approaches using di erent networks
| a spontaneous practice in pragmatic bibliometrics | in order to address a
wide classe of issues on a particular dataset.
      </p>
      <p>Thus, many producers of scientometric indicators rely on the default SCI1 or
Scopus classes, or ad-hoc lists of journals, for a rough delimitation. Pragmatic
mixes improve the process: lexical or nomenclature index queries; supervised lists
of prominent authors, or most cited ones, in multistage and adaptive protocols.</p>
      <p>For example, a high-precision detection of a core literature may be completed
by various recall-oriented extensions (query-expansion, network proximity, etc.)
with the possible help of science maps. Protocols involving multi-network design
are fruitful. Two networks are generally held as the most precise for thematic
analysis: lexical terms (words/phrases from controlled or natural language) and
paper-level citations; the author networks are also valuable. Citations and words
exhibit analogies and also di erences: granularity level, statistical properties,
uni cation issues especially in natural language, dynamic capability | more
direct for citation. Citation biases are severe in an evaluation context but
somewhat less in a mapping context. Lexical analysis is prone to natural language
traps, homonymy, synonymy, metaphors, etc. There is huge literature on the
properties of the two universes.</p>
      <p>
        Sequential word-citation protocols give a few examples of how themes
delimitation and mapping take advantage of those properties. Among those mentioned
in the aforementioned synthesis, the sequence that starts with careful core
building, using supervised lexical queries, and that goes on with a quasi-automatic
citation-based extension, proved e cient ([
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and related works).
      </p>
      <p>
        Alternatively, one may rst compare techniques, by exploring the
citationway and the lexical way in parallel, and then compare and possibly combine
the results. We explore this at a cluster delineation level, namely the detection
of areas within a given domain (beforehand delimited by hybrid techniques:
nanoscience, genomics. . . ). We established that word and citation techniques
shape fairly convergent breakdowns but without coincidence, a phenomenon well
characterized in cross-maps [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Hence, the two approaches are akin but not
substitutable, as each one carries its own infometric and sociological point of
view.
      </p>
      <p>
        The \full hybrid" approaches, where citation and text tokens are mixed, di er
from the mildly hybrid ones. Even in naive form, hybrid bibliographic coupling
outperformed simple b.c. in Boyack and Klavans studies [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Flexible integrated
approaches account for di erences in statistical distributions [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Full
hybridization privileges the information science perspective and deliberately overviews
the sociological dimension of the two networks.
      </p>
      <p>Most classical techniques rely on citation keys and, in the lexical realm,
on word n-grams or more sophisticated linguistic analysis. Instead, featureless
representation of information such as character n-grams or bit sequences, proved
e cient in our experience. Related n-gram or compression distances are used.</p>
      <p>1 SCI, WoS, WoK product from ISI, Thomson, Thomson-Reuters, now Clarivate.</p>
      <p>Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).</p>
      <p>BIR 2020, 14 April 2020, Lisbon, Portugal.</p>
      <p>131
The cost is the black-box e ect, losing any direct semantic interpretation. This
can be operated eld by eld (e.g. actors/ all text/ references) or for the whole
document. In the latter case, the representation is savagely hybrid.</p>
      <p>
        Delineation ultimately aims at de ning a domain in terms of documents,
but this objective may be met through indirect paths (rather than inter-articles
coupling) and probabilistic model. Citation-in-context studies, following Small's
investigations ([
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and many others) changed the granularity of hybrid
thinking in bibliometrics. The vizualisation of citation contexts in online engines is
now classical. Narrowing the context of particular citations also has various
applications in bibliometric studies: section a liation of references (introduction,
methods, ndings. . . ); a step further, association or mutual labelling of words
and reference within linguistic units (sentences, paragraphs. . . ). From this ne
granularity, hybridization techniques expect a strong gain in precision, whatever
the aim, historiography, classi cation, etc. Prior decomposition of an article in
smaller lexical units might enhance the convergence of cocitation and coword
clustering in the afore-mentioned cross-mapping exercise.
      </p>
      <p>As previously stated, the frontier with micro-level small topic delineation
is fuzzy. Meso-scale bibliometric studies, which involve a breakdown into many
topics, cannot generally a ord detailed supervision at this micro-level, where
themes are detected by some automatic data analysis. However, using
crossmaps or hybrid designs, may be helpful to look for robust \strong forms" or at
least cores in multinetworks, resisting to change of settings and vantage points.</p>
      <p>
        Supervision is nevertheless necessary at the key stages, and costly. For both
validity, acceptability and involvement of actors, the careful dimensioning and
conduct of supervision is a key factor in large studies.
An extensive bibliography is found in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Below, just a few indications on our
works, and a few classical milestones.
Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
      </p>
      <p>BIR 2020, 14 April 2020, Lisbon, Portugal.</p>
      <p>132</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Boyack</surname>
            ,
            <given-names>K.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klavans</surname>
            ,
            <given-names>R.:</given-names>
          </string-name>
          <article-title>Co-citation analysis, bibliographic coupling, and direct citation: Which citation approach represents the research front most accurately</article-title>
          ?
          <source>Journal of the American Society for Information Science and Technology</source>
          <volume>61</volume>
          (
          <issue>12</issue>
          ),
          <volume>2389</volume>
          {
          <fpage>2404</fpage>
          (
          <year>2010</year>
          ), doi:10.1002/asi.21419
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Ingwersen</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Cognitive perspectives of information retrieval interaction: Elements of a cognitive IR theory</article-title>
          .
          <source>Journal of Documentation</source>
          <volume>52</volume>
          (
          <issue>1</issue>
          ),
          <volume>3</volume>
          {
          <fpage>50</fpage>
          (
          <year>1996</year>
          ), doi:10.1108/eb026960
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Janssens</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          , Glanzel, W.,
          <string-name>
            <surname>De Moor</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis</article-title>
          . In: Berkhin,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Caruana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <surname>X.</surname>
          </string-name>
          , Ga ney, S. (eds.)
          <source>KDD'07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          . pp.
          <volume>360</volume>
          {
          <fpage>369</fpage>
          .
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery (
          <year>2007</year>
          ), doi:10.1145/1281192.1281233
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Small</surname>
            ,
            <given-names>H.:</given-names>
          </string-name>
          <article-title>Co-citation context analayses and the structure of paradigms</article-title>
          .
          <source>Journal of Documentation</source>
          <volume>36</volume>
          (
          <issue>3</issue>
          ),
          <volume>183</volume>
          {
          <fpage>196</fpage>
          (
          <year>1980</year>
          ), doi:10.1108/eb026695
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Zitt</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Meso-level retrieval: IR-bibliometrics interplay and hybrid citationwords methods in scienti c elds delineation</article-title>
          .
          <source>Scientometrics</source>
          <volume>102</volume>
          (
          <issue>3</issue>
          ),
          <volume>2223</volume>
          {
          <fpage>2245</fpage>
          (
          <year>2015</year>
          ), doi:10.1007/s11192-014-1482-5
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Zitt</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bassecoulard</surname>
          </string-name>
          , E.:
          <article-title>Delineating complex scienti c elds by an hybrid lexical-citation method: An application to nanosciences</article-title>
          .
          <source>Information Processing &amp; Management</source>
          <volume>42</volume>
          (
          <issue>6</issue>
          ),
          <volume>1513</volume>
          {
          <fpage>1531</fpage>
          (
          <year>2006</year>
          ), doi:10.1016/j.ipm.
          <year>2006</year>
          .
          <volume>03</volume>
          .016
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Zitt</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lelu</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bassecoulard</surname>
          </string-name>
          , E.:
          <article-title>Hybrid citation-word representations in science mapping: Portolan charts of research elds?</article-title>
          <source>Journal of the American Society for Information Science and Technology</source>
          <volume>62</volume>
          (
          <issue>1</issue>
          ),
          <volume>19</volume>
          {
          <fpage>39</fpage>
          (
          <year>2011</year>
          ), doi:10.1002/asi.21440
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Zitt</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lelu</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cadot</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cabanac</surname>
          </string-name>
          , G.:
          <article-title>Bibliometric delineation of scienti c elds</article-title>
          . In: Glanzel, W.,
          <string-name>
            <surname>Moed</surname>
            ,
            <given-names>H.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmoch</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thelwall</surname>
          </string-name>
          , M. (eds.)
          <source>Springer Handbook of Science and Technology Indicators, chap. 2</source>
          , pp.
          <volume>25</volume>
          {
          <fpage>68</fpage>
          . Springer, Berlin (
          <year>2019</year>
          ), doi:10.1007/978-3-
          <fpage>030</fpage>
          -02511
          <article-title>-3 2 Copyright © 2020 for this paper by its authors</article-title>
          .
          <article-title>Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4</article-title>
          .0).
          <source>BIR</source>
          <year>2020</year>
          ,
          <volume>14</volume>
          <issue>April 2020</issue>
          , Lisbon, Portugal. 133
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>