<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>OpenCSMap: A System for Geolocating Computer Science Publications</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Felipe Manen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aidan Hogan</string-name>
          <email>ahogang@dcc.uchile.cl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DCC, University of Chile; IMFD;</institution>
          <addr-line>Santiago</addr-line>
          ,
          <country country="CL">Chile</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Although academic search engines are important tools for researchers, they typically support limited (if any) geographical features. In this demo we present a system that allows researchers to search for a speci c Computer Science research topic and visualize in a map which a liations have publications matching their search. The dataset is based on DBLP, using Entity Linking (OpenTapioca) over author a liations to nd geographic metadata for publications from Wikidata. Demo link: http://opencsmap.dcc.uchile.cl/</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Academic search engines help researchers to nd relevant literature for their
topics of interest and to understand the impact of publications, venues, and
authors. However, we are not aware of any available search engine that can return
information about how many publications on a given topic are associated with a
speci c city or region. A system with such characteristics could help researchers
to nd new collaborators that work close by on similar topics; to select a place
for organizing a conference or for continuing their career; to identify and
include researchers from under-represented regions when organizing conferences;
to understand networks of collaboration between cities and countries; etc.</p>
      <p>
        Addressing this gap, we propose a system called OpenCSMap, which allows
users to search over publications by keyword, and then aggregates all matching
publications geographically (based on a liation) in a map visualization. The
current version of the system focuses on Computer Science publications, and
is based on a snapshot of DBLP [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], extracting meta-data for all the
conference and journal papers it describes. Geographic features are enabled by linking
the authors' a liations to Wikidata [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] using the OpenTapioca Entity Linking
system [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. SPARQL queries over the Wikidata Query Service are then used to
obtain geographical information (coordinates, city, country) for those a liations.
      </p>
      <p>Copyright c 2021 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
OpenCSMap can be divided into two main parts, relating rst to the dataset
preparation, and second to the search system itself. Semantic Web tools and
resources (Wikidata, OpenTapioca) are used for the dataset preparation, while
a NoSQL back-end (ElasticSearch) is used for the search system. Code for the
system is available from https://github.com/fmanen/OpenCSMap.
Dataset: We use DBLP as the core of our dataset. For OpenCSMap, we consider
journal articles and conference (including workshop) papers, totaling 4,976,740
publications. The publication metadata we extract are the title, authors, year of
publication, the name of the journal or proceedings in which it is published, and
the DOI. A liations are not associated with publications, but rather authors.
Hence it is unclear which a liation an author had when publishing a particular
work. For this reason, and to reduce noise, we design a conservative solution to
select the most common a liation across all authors, assigning one a liation to
nearly 50% of the articles. While more detailed a liation information is available
in commercial search engines such as Google Scholar, these datasets are not open.</p>
      <p>
        We then focus on obtaining the georeferenced information of DBLP's textual
a liations. An overview of the process is shown in Figure 1. We performed
initial experiments using a number of Entity Linking (EL) tools with online
APIs { speci cally DBpedia Spotlight [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], TAGME [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and OpenTapioca [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] { for
linking 100 randomly sampled a liations to entities in DBpedia and Wikidata.
Based on manual evaluation, OpenTapioca provided the best precision (0.84),
followed by TAGME (0.81) and DBpedia Spotlight (0.48). We thus opted to
proceed with OpenTapioca. For each Wikidata ID, we pose SPARQL queries
using Wikidata's Query Service to lter the entities and extract the city and
country of the a liation (if available) and its respective coordinates. We provide
the mapping of DBLP publications to Wikidata identi ers for their a liations
online at https://zenodo.org/record/5038583.
      </p>
      <p>Search system: The search system uses Elasticsearch for storing and querying
data; inverted indexes are built over the publication and geographical metadata.
The web application is built using Django, Bootstrap, and Lea et (for maps).
3</p>
    </sec>
    <sec id="sec-2">
      <title>Demo</title>
      <p>On the landing page, the user is presented with a keyword search dialog that
can be used to search for a topic of interest (for example, \semantic web"). The
search is matched on paper and/or conference/journal titles. When the search
button is clicked a map visualization is rendered as shown in Figure 2. The map
shows all of the a liations that have at least one publication in the results. If
there are multiple a liations nearby in the map, a cluster will be displayed with
the number of a liations in the cluster. As the user zooms in on a particular
region, the clusters will become more ne-grained until markers are associated
with only one a liation. If such a marker is clicked, the number of matching
publications and a list of publications are shown for that a liation.</p>
      <p>Aside from direct keyword searches, the system also supports an advanced
search feature for restricting and aggregating results in a custom way. The
interface for advanced search is shown in Figure 3. It enables users to search by
topic, by author, by publication type, and by year of publication. It also allows
for aggregating publications at the level of a liation, city or country.
4</p>
    </sec>
    <sec id="sec-3">
      <title>Experiments</title>
      <p>We performed some initial user experiments to understand the relative strengths
and weaknesses of the system. First we measured the response times for search;
for these purposes we extracted 5,000 randomly sampled keywords for Computer
Science papers from the Open Academic Graph1. We rst performed a search for
a liations with publications matching the keywords against the back-end, where
1 https://www.microsoft.com/en-us/research/project/open-academic-graph/
I understand the purpose of the system 1
The system is intuitive to use 1
The system provides novel features vs. other online systems 1
I would use this system or a system like this 2
The data presented are complete and precise 2
The response times are reasonable 1
the average time for search per keyword was 18.3 11.9 ms (standard deviation),
with each query returning on average 472.3 383.3 matching a liations. We also
tested the front-end times for requesting the map visualization over HTTP, where
each search took on average 147.2 93.1 ms. Though the front-end and HTTP
connections add some overhead, response times remain well below a second.</p>
      <p>We also wished to evaluate the initial impression of users of our system.
We prepared a short questionnaire of six statements on a 5-point Likert scale.
We sent the questionnaire to 20 full time professors of the Computer Science
Department (DCC) of the University of Chile. The respondents were asked to
try some searches of their choosing and to explore the system before responding.
No detailed guidance on how to use the system, or its purpose, was provided. We
received 10 responses. The statements and results are shown in Table 1. Overall
the responses generally lean positive. The most positive aspect related to the
low response times of the system, while more mixed or neutral responses were
seen in terms of whether or not the respondents would use such a system, and
regarding the completeness and precision of the data they observed.
5</p>
    </sec>
    <sec id="sec-4">
      <title>Limitations and future work</title>
      <p>OpenCSMap is a work in progress, and has some limitations that we aim to
address in future work. The main limitation of the current system relates to
the sparsity of a liation data in DBLP, where we only nd a liations for 50%
of the publications. For future work we intend to assign an a liation per
authorship of a publication, and to take into account temporal aspects relating
to a liations changing. Not all a liations can be geolocated through Wikidata.
Though this will become less of an issue as Wikidata expands and improves,
generic a liations such as IBM may be too vague to be geolocated accurately.</p>
      <p>
        Another limitation is the recall of the searches. Our approach currently
matches the search keyword(s) against some of the elds of a publication
(title, proceedings or journal). Although this tends to provide relevant results, it
may miss results; for example, a paper mentioning \RDF" but not \semantic
web" would not be included in results for the latter. We are thus currently
investigating adding topic classi cations to the dataset (based on CSO [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]).
      </p>
      <p>Though informative, our evaluation of the usability and usefulness of the
system { whose results are summarized in Table 1 { remains preliminary, where
our survey features some high-level questions and a relatively low number of
responses (10) from users who are all professors. Performing more detailed
evaluation with a more diverse set of users would help us to gain further insights on
how to improve the system, and to tailor it for speci c preferences and use-cases.</p>
      <p>Other interesting ideas for future work include additional ltering options
to select publications of interest; integrating metrics for papers, conferences,
journals, etc., in order to support ltering by the impact of the publication or
the prestige of the venue; inclusion of additional details for the papers indexed,
such as abstracts; better navigation of the hierarchical information (e.g., from
region to country to city to institution to publication, and back); autocompletion
of speci c topics of interest when searching by keyword; and a temporal slider
to dynamically restrict the publications displayed on the map.</p>
      <p>Acknowledgements We would like to thank Henry Rosales and Daniel Diomedi
for their help with the Entity Linking process. We also thank the reviewers for
their suggestions on potential future improvements for the system. This work was
supported by ANID { Millennium Science Initiative Program { Code ICN17 002
and by FONDECYT Grant No. 1181896.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Delpeuch</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>OpenTapioca: Lightweight Entity Linking for Wikidata</article-title>
          . In: Wikidata Workshop. CEUR-WS.org (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Ferragina</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scaiella</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          : TAGME:
          <article-title>on-the- y annotation of short text fragments (by Wikipedia entities)</article-title>
          .
          <source>In: CIKM</source>
          . pp.
          <volume>1625</volume>
          {
          <fpage>1628</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Ley</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>DBLP - some lessons learned</article-title>
          .
          <source>PVLDB</source>
          <volume>2</volume>
          (
          <issue>2</issue>
          ),
          <volume>1493</volume>
          {
          <fpage>1500</fpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Mendes</surname>
            ,
            <given-names>P.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jakob</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garc</surname>
            a-Silva,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>DBpedia Spotlight: shedding light on the web of documents</article-title>
          .
          <source>In: I-SEMANTICS</source>
          . pp.
          <volume>1</volume>
          {
          <issue>8</issue>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Salatino</surname>
            ,
            <given-names>A.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thanapalasingam</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mannocci</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Birukou</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Osborne</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Motta</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          :
          <article-title>The Computer Science Ontology: A Comprehensive AutomaticallyGenerated Taxonomy of Research Areas</article-title>
          .
          <source>Data Intell</source>
          .
          <volume>2</volume>
          (
          <issue>3</issue>
          ),
          <volume>379</volume>
          {
          <fpage>416</fpage>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Vrandecic</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , Krotzsch, M.:
          <article-title>Wikidata: a free collaborative knowledgebase</article-title>
          .
          <source>Commun. ACM</source>
          <volume>57</volume>
          (
          <issue>10</issue>
          ),
          <volume>78</volume>
          {
          <fpage>85</fpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>