<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <article-id pub-id-type="doi">10.1007/s11192-006-0115-z</article-id>
      <title-group>
        <article-title>Preliminary Results of a Scientometric Analysis of the German Information Retrieval Community 2020-2023</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Philipp Schaer</string-name>
          <email>philipp.schaer@th-koeln.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Svetlana Myshkina</string-name>
          <email>svetlana.myshkina@smail.th-koeln.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jüri Keller</string-name>
          <email>jueri.keller@th-koeln.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>TH Köln (University of Applied Sciences)</institution>
          ,
          <addr-line>Gustav-Heinemann-Ufer 54, 50968 Köln</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <volume>41</volume>
      <fpage>547</fpage>
      <lpage>554</lpage>
      <abstract>
        <p>The German Information Retrieval community is located in two diferent sub-fields: Information and computer science. There are no current studies that investigate these communities on a scientometric level. Available studies only focus on the information scientific part of the community. We generated a data set of 401 recent IR-related publications extracted from six core IR conferences from a mainly computer scientific background. We analyze this data set at the institutional and researcher level. The data set is publicly released, and we also demonstrate a mapping use case.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Information retrieval</kwd>
        <kwd>scientometric analysis</kwd>
        <kwd>data set</kwd>
        <kwd>co-authorship</kwd>
        <kwd>research institutes</kwd>
        <kwd>networks</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>2000. More recent studies are scarce and mostly look at the information science side of the
community or do not look at the German community specifically.</p>
      <p>Therefore in this work, we will investigate some recent characteristics of the German IR
community by building a data set of recent publications from six core IR conferences. We will
analyze this data set at the institutional and the individual researcher levels to learn more about
central actors in the field and to show first and preliminary results of this study.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        An example of a German-focused study is the work of Baumgartner and Schlögl [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], who
analyzed the proceedings of the International Symposium on Information Science (ISI) between
1990 and 2004. They discussed the international claim of the conference with respect to academic
coverage and the quality of the publications but did not focus on IR specifically. The analysis
showed that the articles were written by 1.6 authors on average. This corresponds to the usual
publication behavior in the information sciences. Most of the articles, 81%, are written in
German, but 57 of all cited references were in English. The evaluation of the ISI conference
proceedings revealed a highly skewed distribution of authorship. The share of authors with
more than three publications was only 4.8%, while specific authors are constantly represented
and form a core community. The most productive research groups came from Konstanz, Graz,
Regensburg, Hildesheim, and Saarbrücken.
      </p>
      <p>
        In 2015, Lewandowski and Haustein [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] analyzed the citation behavior of German information
scientists and studied the cited literature within the handbook “Grundlagen der praktischen
Information und Dokumentation”, which has a specific set of chapters related to IR topics. They
found similar patterns to Baumgartner and Schlögl, by having an average of 1.4 authors per
paper and mostly citing journal articles and writing in German. Considering the findings of
Larivière et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] that due to the specialization of authors in specific research areas, these can
be representative of topics (concepts), they evaluate the results of their cluster analysis. They
deduce that German-speaking information scientists probably work on somewhat distant topics.
Consequently, the German-language information science community does not seem cohesive.
      </p>
      <p>
        More tailored explicitly to the IR community, the bibliometric studies of Ding et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ],
Thornley et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], and more recently Larsen [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] have to be mentioned. Ding et al. used a
co-word analysis to map the academic community in the field of information retrieval. While
this work is highly recognized in the field with more than 1000 citations, it is not giving any
insights into current trends, groups, or individuals, as it is based on a publication analysis from
publications from before the year 2000. The analyses of Thornley et al. and Larsen are more
recent but do not focus on Germany and also look at single conferences (TRECVid and CLEF)
only.
      </p>
      <p>While conducting scientometric studies one has to keep in mind that these kinds of studies are
often limited. Selection bias and inclusion/exclusion criteria are usual suspects when it comes to
the validity of these kinds of studies [7]. These are not only introduced on the researcher level,
who chooses which publication to include in the analysis but also on the data provider level [8].
Database curators can influence the coverage of the databases scientometric researchers work
on and therefore influence the results.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Data Set Generation</title>
      <p>We compiled a data set of research publications by analyzing the proceedings of six major
IR-related and peer-reviewed conferences (CHIIR, CIKM, CLEF, ECIR, SIGIR, WWW) available
in the ACM Digital Library. The publication dates ranged between January 2020 and June
2023. We only considered those publications that had at least one German author or an author
that was afiliated with a German research institute. Other central publication venues of the
community, like the Springer Information Retrieval Journal or ACM TOIS were not considered,
as we wanted to focus on conferences. The TREC and CLEF workshop notes were also not
included due to the missing peer review.</p>
      <p>For the six conferences mentioned above, we gathered the following publications’ metadata:
• author names,
• afiliation,
• titles,
• DOI of the publication.</p>
      <p>As afiliation names were given in many diferent variants, we harmonized them manually.
We added detailed information about the working group as well as the complete postal address
and geo-location. In the case of diferent departments of research groups within an institute,
we kept all sub-groups separate as long as the sub-group names were explicitly mentioned in
the publications. In total, we discovered 401 publications and 195 diferent afiliations. The data
set is publicly available in a GitHut repository3.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Most Productive Research Groups</title>
      <p>From the 195 research groups we found in the data set, we extracted the ten most productive
ones (see Table 1). The most productive group is the Webis Group from Weimar, Leipizg, Jena,
and formerly Halle, followed closely by the Databases and Information Systems group from
Max Planck Institute and the L3S research center. Webis and L3S are “virtual groups” whose
members also have co-afiliations with universities or non-university research centers (like
TIB Hannover). We decided to not split up these publications, as the publishing collaboration
within these institutes is intense, and authors publicly afiliate with them. Due to the size of
these groups, it’s not surprising to see them at the top of the list. If we would have split up
these groups, the ordering would not have changed a lot, as the two central locations of Webis
Weimar and Leipzig would still have made it to the top.</p>
      <p>The list also includes one commercial research institute (Bosch Center for Artificial
Intelligence), one university of applied sciences (TH Köln), and two non-university research centers
(GESIS and TIB), showing the interdisciplinary and heterogeneous constitution of the German
IR community.</p>
      <p>Next to the sum of all publications within the time frame, we separated the count for each
of the six core conferences. We also see a mixed set of publication profiles. The University of
Regensburg only published at CHIIR and ECIR, while Max Planck, Webis, and L3S had a more
heterogeneous coverage.</p>
      <p>3https://github.com/irgroup/LWDA2023-IR-community</p>
      <p>Webis Group
Max Planck Institute - Databases and Inf. Systems
Forschungszentrum L3S
GESIS - Leibniz-Institut für Sozialwissenschaften
TIB - Forschungsgruppe Visual Analytics
U Bonn - Data Science &amp; Intelligent Systems
U Regensburg - Chair of Information Science
U Mannheim - Data and Web Science Group
Bosch Center for Artificial Intelligence
TH Köln - Information Retrieval Research Group
Total</p>
    </sec>
    <sec id="sec-5">
      <title>5. Co-Authorship in the German IR Community</title>
      <p>In contrast to the previously mentioned observations from the field of information science, we
wanted to check on the characteristics of co-authorship and collaboration in more computer
science-related IR conferences. In Table 2 the number of authors ranges from 1 to 17 for single
publications. On average, the top ten research groups published papers with 4.83 authors,
while on the whole data set the average number of authors was 4.98. With eight authors the
Webis Group has the highest number of authors per paper on average, while the second most
productive group (Max Planck) has the lowest number of authors per paper (three on average).
The number of authors alone can therefore not explain the publication success of a group.
We should nevertheless keep in mind that the number of co-authors per paper can introduce
distortions for the calculation of additional network-based performance metrics [9].</p>
      <p>Given these limitations, we analyzed the publications with the help of a co-authorship network.
All author collaborations form a network of 1159 nodes (authors) and 4907 edges (co-authorship
Author
Lucie Flek
Martin Potthast
Ralph Ewerth
Benno Stein
Jens Lehmann
Stefan Dietze
Gerhard Weikum
Avishek Anand
Rishiraj Saha Roy
Daniel Hienert
Kuldeep Singh
Megha Khosla
Norbert Fuhr
Andrew Yates
Axel-Cyrille Ngonga Ngomo
Matthias Hagen
Ran Yu
Chris Biemann
Sherzod Hakimov
Maria-Esther Vidal
Endri Kacupaj
Henning Wachsmuth
David Elsweiler
Andreas Both
Janek Bevendorf
Maria Maleshkova
Philipp Schaer
Timo Breuer
Alexander Bondarenko
Yvonne Kammerer</p>
      <p>Afiliation
U Bonn / U Marburg
Webis Group
TIB Hannover
Webis Group
Amazon
GESIS, Köln
Max Planck Institute
L3S
Max Planck Institute
GESIS, Köln
Cerence
L3S
U Duisburg-Essen
Max Planck Institute
U Paderborn
Webis Group
GESIS, Köln
U Hamburg
U Potsdam
L3S
Cerence
Webis Group
U Regensburg</p>
      <p>HTWK Leipzig
Webis Group</p>
      <p>U der Bundeswehr, Hamburg
TH Köln
TH Köln
Webis Group
HDM Stuttgart
9
relations). To allow a more nuanced impression in comparison to simple publication counts, we
calculated betweenness centrality on the network. Betweenness centrality tells us how important
a node is to form the network by connecting diferent parts. In a co-authorship network, these
are authors that bridge diferent communities of working groups and are important for the
connectivity of the network.</p>
      <p>In Table 3, we see the top thirty best-connected authors of the field. We only selected those
researchers who work at German research institutes. However, we can see in the co-authorship
network that many other well-connected researchers are from outside of Germany. The
topranked researcher is Lucie Flek (U Bonn and U Marburg), with only three publications in total,
which might be surprising, but these papers were published at WWW, SIGIR, and ECIR and had
no single overlap in co-authors. Therefore she is a well-connected author in the co-authorship</p>
    </sec>
    <sec id="sec-6">
      <title>6. Topics of Publications per Research Group</title>
      <p>We applied a simple topic modeling of the top ten research institutes based on the titles of the
publications published by each institute. The titles are combined into documents and the terms
in these documents up to bi-grams are weighted by TF-IDF after a basic text processing. The top
terms per institute with the highest TF-IDF give insights about the topics the institute mainly
focuses on, and which diferentiates it from the other groups. Table 4 gives an overview of
the top 3 terms per research institute. While most terms like knowledge, question answering
or search are related to IR, some terms appear unexpected. The highest ranked term for the
Bosch Center for Artificial Intelligence is, for example, welding. Since this term is rarely used in
the context of IR, in two publications the institute describes how welding can be monitored
through the help of IR techniques [10, 11].</p>
      <p>Likewise, the title terms of the publications with German authors were analyzed by conference.
By comparing the top terms of the groups with the top terms of the conference, a group mainly
publishes some correlations could be found. For example, for the Max Planck Institute with the
Databases and Information Systems group question answering has a high TF-IDF, which ranks
high for the SIGIR conference in which the group mainly publishes. Other indirect similarities
can be observed for example between the L3S with machine learning-related terms and the
CIKM. Since the initial data set for the topics of the conferences and groups is the same and,
therefore productive groups significantly influence the top terms per conference, a correlation
is not surprising.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Visualizing IR research groups</title>
      <p>We use the available data to draw the geo-locations of each research group on a map using
the OpenStreetMap platform. The map is available online4. The map includes all groups and
institutes that published at least two times in the six previously mentioned conferences. To
extend the map with additional institutes that might be missed, we added groups that were
active in academic societies, like the German Special Interest Group on Information Retrieval
between 2020 and 2023. Each data point on the map includes the name of the group, the postal
address, and a link to the institute’s homepage. Figure 1 shows a sample of the map.</p>
    </sec>
    <sec id="sec-8">
      <title>8. Discussion and Future Work</title>
      <p>We conducted a small-scale scientometric study of the German IR community using publications
from 2020 till mid-2023. For a more in-depth investigation, we would need a more extended
time coverage and should investigate more conferences to reflect the field’s heterogeneity.
Additionally, the chosen time frame mainly consists of the COVID-19 pandemic, which might
have introduced some uncommon publication patterns (like submitting to conferences without
traveling to these conferences). The six conferences are relevant to the field, but other related
conferences like JCDL or the ICTIR might also be included in a later version of the data set.
Including CIKM might have introduced some topical shift in the data set, as the main focus of
CIKM is not information retrieval (although some relevant IR papers are located there). A more
ifne-tuned topical selection process might increase the quality of the data set. Additionally, the
decision to discard TREC and CLEF lab notebooks also excluded some potentially interesting
publications and, consequently, may be the reason for missing research groups.</p>
      <p>While all these limitations are known and are valid complaints regarding the preliminary
results of this scientometric study, it’s the only available data set for the German IR community.
The dataset in its current form therefore only allows us to get some first and preliminary insights
into the community, and its actors on the institutional and researcher levels. Nevertheless, it
gives us an idea of the rich collaborations happening in the field.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. G.</given-names>
            <surname>Chowdhury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Foo</surname>
          </string-name>
          ,
          <article-title>Bibliometric cartography of information retrieval research by using co-word analysis</article-title>
          ,
          <source>Information Processing &amp; Management</source>
          <volume>37</volume>
          (
          <year>2001</year>
          )
          <fpage>817</fpage>
          -
          <lpage>842</lpage>
          . URL: https://www.sciencedirect.com/science/article/pii/S0306457300000510. doi:https://doi.org/10.1016/S0306-
          <volume>4573</volume>
          (
          <issue>00</issue>
          )
          <fpage>00051</fpage>
          -
          <lpage>0</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Baumgartner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Schlögl</surname>
          </string-name>
          ,
          <article-title>Die tagungsbände des internationalen symposiums für informationswissenschaft in szientometrischer analyse</article-title>
          , in: A.
          <string-name>
            <surname>Osswald</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Stempfhuber</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          Wolf (Eds.),
          <article-title>"Open Innovation" - Neue Perspektiven im Kontext von Information und Wissen: 10. Internationalen Symposiums für Informationswissenschaft</article-title>
          ,
          <source>ISI</source>
          <year>2007</year>
          , Köln, Germany,
          <volume>30</volume>
          . Mai - 1.
          <source>Juni</source>
          <year>2007</year>
          , volume
          <volume>46</volume>
          of Schriften zur Informationswissenschaft, UVK,
          <year>2007</year>
          , pp.
          <fpage>43</fpage>
          -
          <lpage>59</lpage>
          . URL: https://doi.org/10.5281/zenodo.4134714. doi:
          <volume>10</volume>
          .5281/zenodo. 4134714.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Lewandowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Haustein</surname>
          </string-name>
          ,
          <article-title>What does the german-language information science community cite? - an analysis of the german information science handbook "grundlagen der praktischen information und dokumentation"</article-title>
          , in: F. Pehar,
          <string-name>
            <given-names>C.</given-names>
            <surname>Schlögl</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          Wolf (Eds.),
          <source>Re:inventing Information Science in the Networked Society. Proceedings of the 14th International Symposium on Information Science, ISI</source>
          <year>2015</year>
          , Zadar, Croatia, May
          <volume>19</volume>
          -21,
          <year>2015</year>
          , volume
          <volume>66</volume>
          of Schriften zur Informationswissenschaft, Verlag Werner Hülsbusch,
          <year>2015</year>
          , pp.
          <fpage>93</fpage>
          -
          <lpage>104</lpage>
          . URL: https://doi.org/10.5281/zenodo.17973. doi:
          <volume>10</volume>
          .5281/zenodo.17973.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>V.</given-names>
            <surname>Larivière</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. R.</given-names>
            <surname>Sugimoto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Cronin</surname>
          </string-name>
          ,
          <article-title>A bibliometric chronicling of library and information science's first hundred years</article-title>
          ,
          <source>Journal of the American Society for Information Science and Technology</source>
          <volume>63</volume>
          (
          <year>2012</year>
          )
          <fpage>997</fpage>
          -
          <lpage>1016</lpage>
          . URL: https://onlinelibrary. wiley.com/doi/abs/10.1002/asi.22645. doi:https://doi.org/10.1002/asi.22645. arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/asi.22645.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>C. V.</given-names>
            <surname>Thornley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. J.</given-names>
            <surname>McLoughlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Johnson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. F.</given-names>
            <surname>Smeaton</surname>
          </string-name>
          ,
          <article-title>A bibliometric study of Video Retrieval Evaluation Benchmarking (TRECVid): A methodological analysis</article-title>
          ,
          <source>Journal of Information Science</source>
          <volume>37</volume>
          (
          <year>2011</year>
          )
          <fpage>577</fpage>
          -
          <lpage>593</lpage>
          . URL: http://journals.sagepub.com/doi/10.1177/ 0165551511420032. doi:
          <volume>10</volume>
          .1177/0165551511420032.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>B.</given-names>
            <surname>Larsen</surname>
          </string-name>
          ,
          <source>The Scholarly Impact of CLEF 2010-2017: A Google Scholar Analysis of CLEF Proceedings and Working Notes</source>
          , in: N.
          <string-name>
            <surname>Ferro</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          Peters (Eds.), Information Retrieval
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>