<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>COVIDCube: An RDF Data Cube for Exploring Among-Country COVID-19 Correlations</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tamara Novoa-Rodr guez</string-name>
          <email>tamara.novoa@ug.uchile.cl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aidan Hogan</string-name>
          <email>ahogan@dcc.uchile.cl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DIE/DCC, University of Chile; IMFD;</institution>
          <addr-line>Santiago</addr-line>
          ,
          <country country="CL">Chile</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We present an RDF Data Cube { integrated from numerous sources on the Web { that describes countries in terms of general variables (e.g., GDP, population density) and COVID-19 variables. On top of this data cube, we develop a system that computes and visualises correlations between these variables, providing insights into the factors that correlate with COVID-19 cases, deaths, etc., on an international level. Demo link: https://c19.dcc.uchile.cl/</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        The among-country variation in the numbers of reported COVID-19 cases
and deaths per capita is not well-understood [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Various hypotheses have been
proposed, such as high prevalence of comorbidities, climate, pollution, health
services, public policies, etc. While a number of initiatives have explored this
variance (e.g., [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]), or have developed relevant datasets to better understand this
variance (e.g., [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]), there remains uncertainty regarding the factors involved.
      </p>
      <p>In this demo, we will discuss on-going work regarding the preparation of
an RDF Data Cube that collects together a wide variety of both general and
COVID-19-speci c variables at a country level. On top of this RDF Data Cube,
we have built a system to visually explore the correlations that exist between
such variables in order to gain insights on the among-country variations observed
for COVID-19. We call this data cube and associated system \COVIDCube".
Code is made available online at https://github.com/tmnvrd/COVIDCube.</p>
    </sec>
    <sec id="sec-2">
      <title>COVIDCube</title>
      <p>We rst describe the data preparation for generating the RDF Data Cube.
Thereafter we describe the system used to visualise correlations.</p>
      <p>Copyright c 2021 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
RDF Data Cube: There are a wide range of datasets available online that provide
di erent types of indicators for countries. In order to narrow down the scope of
the variables considered, we conducted an initial survey in search of hypotheses
relating to the among-country variations observed for reported COVID-19 cases
and deaths. From these, we broadly identi ed three main categories of indicators
relating to economics (e.g., GDP, wages, unemployment), health (e.g., obesity
and other comorbidities, blood type, vitamin-C de ciency, adult and child
mortality rates), and climate (e.g., temperature, precipitation, pollution). We also
identi ed hypotheses relating to miscellaneous factors, such as political
ideologies, transportation networks, tourism, etc. Based on these factors, we began
to identify potential sources of data online at the international level, extracting
525 variables from sources including Our World In Data, Wikipedia, The World
Health Organization, The World Bank, among others. These datasets, mostly
tabular in nature, were extracted as CSV les. We further extracted data for 4
variables pertaining to COVID-19 at the international level from Johns Hopkins
University Center for Systems Science and Engineering (CSSE) and Our World
in Data (OwID), namely: con rmed cases (CSSE), con rmed deaths (CSSE),
con rmed recovered (CSSE), and stringency index (OWiD)1</p>
      <p>
        The data were diverse: some datasets were broken down by temporal or
regional dimensions; measures were provided in di erent units; naming variations
were present for countries or regions; etc. We wished to integrate the data while
modelling their provenance. We thus chose to adopt the RDF Data Cube
vocabulary and model [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], using handcrafted Tarql mappings2 to convert from the raw
CSV for each variable to the desired RDF data. Each variable was treated as
a distinct dataset and we manually added metadata (using PROV-O and other
standard vocabularies) to allow for tracking provenance. Entities mentioned in
the datasets { relating to countries, regions, types of disease, genders, education
level, etc. { were mapped to their corresponding Wikidata identi ers, thus
resolving variations in naming across datasets. Given the time-consuming nature
of this task, rather than mapping all 529 variables to a dataset, we identi ed and
converted 79 variables speci cally relating to hypotheses found during our
survey (including the 4 COVID-19 variables), along with an additional 39 datasets
for other variables of interest, resulting in an RDF Data Cube containing 118
di erent datasets/variables integrated from eight distinct sources, with a total
of 442,420 individual observations covering 251 countries (or territories). The
data are hosted in a Fuseki SPARQL endpoint.3
Visualisation: In order to visually explore among-country correlations between
COVID-19 variables and other variables, we built a prototype system in Flask
{ a micro web framework written in Python { to query the Fuseki back-end for
the available pairs of variables, and compute correlations for them. In terms of
1 The stringency index is a composite measure used to indicate the strictness of public
policies to curtail COVID-19 transmission, including travel bans, school closures, etc.
2 https://tarql.github.io/
3 See https://c19.dcc.uchile.cl/db/dataset.html?tab=query&amp;ds=/ds.
the correlation measures, we chose to use Pearson's r and Spearman's , both of
which provide a value in the interval [ 1; 1], with 1 indicating perfect negative
correlation, 0 no correlation, and 1 perfect positive correlation. We further
calculate p-values in order to indicate the probability of the null hypothesis given
the observed variables: namely that there is no relation between the variables.
We noticed that a confounding factor for many of the variables presented
related to population: we thus normalised selected variables by population in the
query prior to calculating the correlations. The results are then visualised in
four heat-map matrices, collecting together variables categorised by economics,
health, climate and miscellaneous. Each matrix has rows denoting general
variables and columns denoting the four COVID-19 variables, with the rows ordered
from highest correlation to lowest correlation with respect to the total number of
COVID-19 cases. To improve response times, data are cached in the front-end.
      </p>
      <p>We provide a screenshot of part of the visualisation in Figure 1 referring to the
top-10 health variables in terms of positive correlation to con rmed COVID-19
cases. The colours and values indicate the value of correlation, where the colours
range from red (positive) to blue (negative). Results that reach a particular level
of statistical signi cance (p &lt; ) are noted with ( = 0:05) and ( = 0:01).
We see that variables such as life expectancy, body mass index, prevalence of
overweight children, obesity, blood types A and O, etc., correlate positively in
terms of the number of con rmed COVID-19 cases. The reader may notice that
Prevalence of overweight children appears twice in the results. This is because
y-axis labels are sometimes hierarchical, where only the rst level of the label is
shown to avoid overly-long variable names; dimension 5 refers to male children,
while dimension 6 refers to female children, which can be seen by hovering over
the respective result in the interface. Further such results for other variables can
be explored in our online demo: https://c19.dcc.uchile.cl/.
Evaluation: In order to initially understand users' opinions of the system, we
created a quantitative survey that was published in a university forum and received
52 responses. On a Likert scale of 1{5, users were most positive with respect to
the functionality (average: 4.25) and usefulness (average: 4.12) of the platform,
but were less positive regarding how easy it was to understand the information
provided (average: 2.98), where users require some base statistical knowledge of
correlations, p-values, etc., in order to fully understand the data presented.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Discussion</title>
      <p>Correlation obviously does not imply causation, but correlations may lead to
novel hypotheses and prompt further study regarding potential underlying
factors for among-country variations. Though an in-depth analysis of the
correlations found is out-of-scope, in summary, we did nd a number of variables
spanning the di erent categories that had statistically signi cant positive or negative
correlation with the COVID-19 variables: more than would be expected
according to the chosen signi cance level (e.g., we nd more than the expected 1/20
variables satisfying the = 0:05 threshold). Examples of signi cant correlations
can be seen in Figure 1, and also by visiting our demo.</p>
      <p>
        Some correlations observed were to be expected, such as in the case of positive
correlations for obesity, which is associated with comorbidities for COVID-19.
However, other correlations were surprising. For example, one of the most
negatively correlated health-related variables was No access to handwashing facility,
which suggests, with statistically signi cant results, that countries where fewer
people can wash their hands tend to have fewer per-capita cases of COVID-19.
This seems counter-intuitive as handwashing is considered a way to reduce
transmission of the virus. Another (initially) counter-intuitive result is that increased
health expenditure per capita correlates positively with COVID-19 cases, where
one would expect that better health services would help to reduce transmission
and cases. However, by considering the overall matrices, a common
confounding factor begins to emerge: namely the level of development of the country,
with more developed countries tending to have more con rmed cases. Similar
observations can be found elsewhere, and possible explanations include a lack of
testing [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and an increased remoteness [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] of developing countries.
      </p>
      <p>While it is not possible to draw rm conclusions about the among-country
variance observed for COVID-19 from the currently available data, COVIDCube
does provide some insights and clues as to potentially important factors. Deriving
more de nitive conclusions will require integrating diverse data from further
diverse sources, for which RDF Data Cubes provide a relevant solution.</p>
      <p>Regarding the use of RDF Data Cubes, in the original plan for the project,
we had intended to use a relational database as our back-end to load di
erent variables by country. However, when accessing the raw data, we
encountered a variety of issues relating to their diversity, including the use of di erent
units and multipliers; di erent names being used; di erent countries being
included/excluded from certain sources; the same demographics (such as gender,
age, etc.) appearing in di erent measures; some countries having their results
being presented by state or region; di erent measures having di erent
temporal granularities; etc. Rather than cleaning and preprocessing the data to \ t"
a clean relational schema, we rather chose to use RDF Data Cubes to
represent the diverse underlying data in a more complete way, and thereafter use
SPARQL queries to compute the tables from which correlations were extracted.
This approach o ered a greater decoupling between the data preparation and
the application design, where we could focus on representing the underlying data
as completely as possible in the RDF Data Cube, and then later use SPARQL
queries to extract the data needed for the application from multiple sources.</p>
      <p>
        In terms of future directions, we wish to investigate correlations along
temporal dimensions; such data are available in the data cube but are not exploited
by the visualisation. Similarly, although the underlying dataset tracks
provenance information, this is not shown in the current interface; adding links to
the underlying sources used would help others to build upon and reproduce the
results shown. Incorporating other data sources { including integration with
existing RDF datasets (e.g., on COVID-19 [
        <xref ref-type="bibr" rid="ref3 ref7">3,7</xref>
        ]) { would also be of interest to
enrich the data and enhance the analyses possible. As COVID-19 remains an
ongoing phenomenon, it would also be of interest to implement a framework to
automatically update the statistics based on the underlying sources.
Acknowledgements We would like to thank Bryan Ortiz Pizarro, Catalina
Rojas Zun~iga, Cecilia Pilar Mancill, Clemente Parades Gomez, Cristobal Mas as
Duran, Jose Miguel Pacheco, Loreto Palma Donoso, Osvaldo Garay Roos,
Sebastian Aguilera Valenzuela, Tomas Torres Bardavid and Valent an Espina
Carmona for their considerable help with mapping raw data from CSV to Turtle.
We also thank the anonymous reviewers for their very helpful feedback. This
work was supported by ANID { Millennium Science Initiative Program { Code
ICN17 002 and by FONDECYT Grant No. 1181896.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Cyganiak</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reynolds</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tennison</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <source>The RDF Data Cube Vocabulary. W3C Recommendation</source>
          (
          <year>2014</year>
          ), https://www.w3.org/TR/vocab
          <article-title>-data-cube/</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Gisselquist</surname>
            ,
            <given-names>R.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vaccaro</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Why countries best placed to handle the pandemic appear to have fared the worst</article-title>
          .
          <source>The Conversation</source>
          (
          <year>2021</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Michel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          , et al.:
          <article-title>Covid-on-the-Web: Knowledge Graph</article-title>
          and
          <article-title>Services to Advance COVID-19 Research</article-title>
          . In: ISWC. pp.
          <volume>294</volume>
          {
          <issue>310</issue>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Miller</surname>
            ,
            <given-names>L.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bhattacharyya</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miller</surname>
            ,
            <given-names>A.L.</given-names>
          </string-name>
          :
          <article-title>Data regarding country-speci c variability in Covid-19 prevalence, incidence, and case fatality rate. Data in Brief (</article-title>
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Nordling</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Africa's pandemic puzzle: why so few cases</article-title>
          and deaths?
          <source>Science</source>
          <volume>369</volume>
          (
          <issue>6505</issue>
          ),
          <volume>756</volume>
          {
          <fpage>757</fpage>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Sorci</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Faivre</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morand</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Explaining among-country variation in COVID-19 case fatality rate</article-title>
          .
          <source>Scienti c Reports</source>
          <volume>10</volume>
          (
          <issue>18909</issue>
          ),
          <volume>1493</volume>
          {
          <fpage>1500</fpage>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Steenwinckel</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , et al.:
          <article-title>Facilitating the Analysis of COVID-19 Literature Through a Knowledge Graph</article-title>
          . In: ISWC (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>