Reflections on: Modeling Linked Open
                          Statistical Data

 Evangelos Kalampokis1,2 , Dimitris Zeginis1,2 , and Konstantinos Tarabanis1,2
    1
    University of Macedonia, Information Systems Lab, Egnatia 156, Thessaloniki
        54006, Greece ekal@uom.edu.gr, zeginis@uom.gr, kat@uom.edu.gr
2
  Centre for Research & Technology Hellas, Information Technologies Institute, 6th
                 km Xarilaou - Thermi, Thessaloniki 57001, Greece


          Abstract. A major part of Open Data concerns statistics such as eco-
          nomic and social indicators. Statistical data are structured in a multi-
          dimensional manner creating data cubes. Recently, National Statistical
          Institutes and public authorities adopted the Linked Data paradigm to
          publish their statistical data on the Web. Many vocabularies have been
          created to enable modeling data cubes as RDF graphs, and thus creating
          Linked Open Statistical Data (LOSD). However, the creation of LOSD
          remains a demanding task mainly because of modeling challenges related
          either to the conceptual definition of the cube, or to the way of modeling
          cubes as linked data. The aim of this paper is to identify and clarify (a)
          modeling challenges related to the creation of LOSD and (b) approaches
          to address them. Towards this end, LOSD experts were involved in an
          interactive feedback collection and consensus-building process that was
          based on Delphi method. We anticipate that the results of this paper will
          contribute towards the formulation of best practices for creating LOSD,
          and thus facilitate combining and analysing statistical data from diverse
          sources on the Web.

          Keywords: Linked Open Statistical Data · Modeling Challenges · Del-
          phi Method.


1       Introduction
International organizations, governments, and companies are increasingly open-
ing up their data for others to reuse [1]. A major part of open data concerns
statistics [2] such as demographics, economic, and social indicators. Statistical
data are organized in a multidimensional manner, and thus they can be concep-
tualized as data cubes. These data can be an important primary material for
added value services and products, which can increase government transparency,
contribute to economic growth and provide social value to citizens [3].
     Linked data has been introduced as a promising paradigm for opening up data
because it facilitates data integration on the Web [4]. In statistics, linked data
enable performing analytics on top of disparate and previously isolated datasets
[5]. As a result, many National Statistical Institutes and public authorities have
already used the linked data paradigm to publish statistical data on the Web.


Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    Many vocabularies have been created to enable modeling data cubes as RDF
graphs. However, the creation of Linked Open Statistical Data (LOSD) remains
a demanding task mainly because of modeling challenges related either to the
conceptual definition of a cube, or to the way of modeling cubes as linked data.
The former regards challenges such as the number of measures or the number of
units to include in a cube, while the latter is related to the lack of clarity on the
way to apply the proposed vocabularies and the lack of specialized standards.
All the above modeling challenges are currently addressed by data publishers in
an ad hoc manner, and thus they hinder publishing LOSD in a uniform way that
would facilitate their wide exploitation.
    The aim of this paper is to identify modeling challenges related to the creation
of LOSD and approaches to address them. Towards this end, nine LOSD experts
were involved in an interactive feedback collection and consensus-building pro-
cess. The experts indicated and evaluated modeling challenges and approaches
to address them. The goal is to build a consensus on the approaches that can be
adopted to address LOSD modeling challenges.
    The rest of the paper is organized as follows: Section 2 presents the method
that was followed, Section 3 presents the state of the art analysis regarding LOSD
standards. Section 4 briefly presents the results of the Delphi method. Finally,
Section 5 discusses open challenges and Section 6 summarizes the results.


2   Method

The method employed for the LOSD experts involvement is Delphi [6], which
facilitates consensus-building by using a questionnaire with multiple iterations
to collect feedback until a stability in the responses is attained. One of the char-
acteristics of Delphi is that participants remain anonymous to each other. This
prevents the domination of some participants (e.g., because of their reputation).
Delphi can be continuously iterated until consensus is achieved. However, liter-
ature has pointed out that two iterations are often enough to reach sufficient
consensus [7]. The two rounds of the presented study are the following:
    Round 1: Usually the first round uses an open-ended questionnaire. How-
ever, we adopted a common modification that uses a structured (aka closed)
questionnaire based upon a preparatory phase. The preparatory phase contained:
(i) state of the art analysis on the data cube model to identify the main LOSD
modeling constructs, (ii) involvement of experts to identify LOSD modeling chal-
lenges, and (iii) analysis of LOSD standards to identify approaches related to
the modeling challenges. The structured questionnaire asked experts to review,
select or rank the initially identified approaches related to the modeling chal-
lenges. As a result, areas of disagreement/agreement were identified. The results
included advantages/disadvantages of the publishing approaches as well as other
publishing approaches not identified at the preparatory phase.
    Round 2: The collected feedback of the first round was organized and a
second questionnaire was created. This questionnaire was re-structured to be
more comprehensive and incorporated the advantages/disadvantages identified
at the first round to provide additional insights to the experts. It also contained
approaches of the first round in which consensus was achieved so that experts
can review them. In every question, the experts were asked to state the rationale
behind their choice. The result of Round 2 included all the LOSD modeling
challenges, an analysis of the approaches related to these challenges, and all the
approaches where consensus was achieved.
    The selection of appropriate experts is very important in a Delphi study
since it affects the quality of the produced results. Usually, around ten experts
are sufficient. In our study, we included 9 experts in the area of LOSD:

 – An expert involved in the creation of the LOSD portals of the Scottish
   Government (http://statistics.gov.scot) and the UK Department for
   Communities and Local Government (http://opendatacommunities.org).
 – An expert involved in publishing of LOSD for the Flemish Government
   (https://id.milieuinfo.be).
 – An expert involved in the creation of the LOSD portal for the European
   Commission’s Digital Agenda (http://digital-agenda-data.eu/data).
 – An expert involved in the creation of the portal of the Italian National
   Institute of Statistics (http://datiopen.istat.it).
 – An expert involved in the creation of the QB vocabulary.
 – An expert who created LOSD using data from international organizations
   such as Eurostat, OECD, IMF and World Bank.
 – An expert working at National Institute of Statistics and Economic Studies.
 – An expert working in academia.
 – An expert working in industry.

    The study took place in 2017 and comprised two rounds that lasted two
months each. In order to facilitate the process we exploited Mesydel3 an online
service that supports Delphi enabling the participation of multiple experts.


3     Preparatory phase: State of the art analysis

Statistical data usually concern aggregated data monitoring social and economic
indicators [8]. They can be described in a multidimensional way, where a measure
is described based on a number of dimensions. Thus, statistical data can be
conceptualized as a data cube. The data cube model has already been defined
in the literature [9, 10], and comprises a set measures which represent numerical
values, and dimensions, which provide contextual information. Each dimension
comprises a set of values (e.g., “Greece”, “France”) that can be hierarchically
organized into levels (e.g., country, region). The location of each cube’s cell is
specified by the dimension values, while the value of a cell specifies the measure
(e.g., the unemployment rate of “Greece” in “2016” is “23.1%”).
    A number of linked data standard vocabularies have been proposed to enable
the publishing of data cubes. The QB vocabulary [11] is a W3C standard for
3
    https://mesydel.com
publishing data cubes. The core class of the vocabulary is the qb:DataSet that
represents a cube, which comprises a set of dimensions (qb:DimensionProperty),
measures (qb:MeasureProperty), and attributes (qb: AttributeProperty). Each
qb:DataSet has multiple qb:Observation that describe the cells of the cube.
    At LOSD it is a common practice to re-use predefined code lists to populate
the dimension values. For example, the values of the time dimension can be
obtained from the code list defined by reference.data.gov.uk or the values of
the unit of measure can be obtained from the QUDT units vocabulary4 . However,
predefined code lists does not always exist, so new should be specified using the
QB vocabulary or the Simple Knowledge Organization System (SKOS) [12] or
the Extended Knowledge Organization System (XKOS) [13].
    Finally, a UK Government Linked Data Working Group5 has developed a set
of common dimensions (timePeriod, refArea, sex, age), measures (obsValue) and
attributes ( unitMeasure) that are intended to be reusable across data sets. The
definition of these concepts is based on the SDMX guidelines.
    All the above standard vocabularies facilitate the publishing of LOSD. How-
ever, in some cases there is lack of clarity on how to apply these standards
because they allow the adoption of different valid publishing approaches.


4     Delphi Results: Challenges and Approaches
This section briefly presents the results of the Delphi method. Tables 1 and 2
contain all the identified LOSD modeling challenges and the approaches where
consensus was achieved among the experts. The following paragraphs elaborate
on some challenges and approaches that need further clarification.
    At measure definition (Ch1), a common property is sdmx-measure:obsValue.
However, experts indicated that it should not be used because defining a measure
as sub-property of sdmx-measure:obsValue is redundant. It does not add any
additional semantics than defining the measure as a qb:MeasureProperty.
    Regarding the definition of unit (Ch2.3) the QB vocabulary enables differ-
ent levels, i.e., the qb:DataSet, the qb:MeasureProperty, and the qb:Observation.
The level qb:DataSet or the qb:MeasureProperty facilitates the retrieval of units
directly from the structure of the cube. While the level qb:MeasureProperty or
the qb:Observation enables the definition of multiple units at one cube. The
qb:Observation level enables observation to be re-used at another context since
they contain all relevant information. Expert proposed to use a hybrid approach
and define the unit both at qb:Observation and qb:DataSet if needed.
    The QB vocabulary proposes two practices for the definition of multiple
measures per cube (Ch4): i) “multi-measure observations” that define multiple
qb:MeasureProperty in the cube structure and use all measures in every obser-
vation and ii) “measure dimension” that defines multiple qb:MeasureProperty
at the structure, but restrict observations to having a single measure. The first
approach produces smaller in sizes cube but cannot represent multiple units and
4
    http://qudt.org/
5
    https://github.com/UKGovLD/publishing-statistical-data
                                          Table 1. Challenges and approaches

                    Challenges             Approaches
                    Ch1:What property Ap1: A new measure property should be defined that is not
Measure

                    should be used to sub-property of sdmx-measure:obsValue. The new measure en-
                    model a measure of a ables the annotation with additional properties (e.g., labels,
                    cube?                  comments).
                    Ch2.1:Should         a Ap2.1: A unit of measure should always be included in the cube.
                    cube include the unit The measure on its own is a plain numerical value and thus unit
                    of the measure?        is required to correctly interpret this value.
                    Ch2.2: What RDF Ap2.2: sdmx-attribute:unitMeasure should always be re-used
                    property should be to define units. This property can be used directly to assign
                    used to define the values that are not part of a code list (e.g., QUDT). However,
                    unit?                  when annotation with additional properties (e.g., labels, code-
Unit


                                           list, etc.) is required, then new units that are sub-properties of
                                           sdmx-attribute:unitMeasure should be defined.
                    Ch2.3:Where should Ap2.3: The unit should be defined at the qb:Observation. The
                    the unit be defined? unit can be additionally defined at the qb:DataSet in order to
                                           facilitate the retrieval of the available units in a cube.
                    Ch2.4:What values Ap2.4: URIs from QUDT should be re-used. If QUDT is not
                    should be used for sufficient, then DBpedia or other code lists can be used.
                    the units?
                    Ch3.1: Should one Ap3: One cube with multiple units should be created and the
Multiple units


                    cube include multi- unit should be defined at each qb:Observation. Conceptually, it
                    ple units for the same is preferable to have all related units of the same measure in
                    measure?               the same cube. The unit can be additionally defined at the
                    Ch3.2: Where to de- qb:DataSet in order to facilitate the retrieval of the available
                    fine multiple units? units in a cube.
                    Ch4: How to            Ap4.1: If the data have multiple measures, then it is common
                    model                  to publish cubes with multiple measures only when measures
                    multiple               are closely related to a single observational event (e.g. sensor
Multiple measures


                    measures per           network measurements). However, the approach to be followed
                    cube?                  is up to the data cube publisher. In case of modeling multiple
                                           measures in multiple cubes with one measure each, then Ap2
                                           (if the measures have one unit) and Ap3 (if the measures have
                                           multiple units) should be followed.
                                           Ap4.2: In case of modeling multiple measures in one cube then
                                           the measure dimension approach (i.e. observations with a single
                                           measure) should be followed and the unit should be defined in
                                           each observation (see Ap 3).
                    Ch5: What              Ap5.1: If a dimension refers to time, geography, or
                    rdf:Properties         age, then a new qb:DimensionProperty should be defined.
                    should be              This new qb:DimensionProperty should be also defined as
                    used for               rdfs:subPropertyOf the corresponding SDMX dimension. For ex-
                    common                 ample, a geospatial dimension of a cube should be defined as
Dimension


                    dimensions?            sub-property of sdmx-dimension:refArea.
                                           Ap5.2: If a dimension refers to gender, then sdmx-
                                           dimension:sex should be reused provided that the associated
                                           code list addresses the modeling needs, e.g., more notions of
                                           sex such as hermaphroditism, transgender, and asexual are not
                                           needed. Otherwise, a new dimension should be defined along
                                           with a controlled vocabulary.
                                                Table 2. Challenges and approaches

                          Challenges             Approaches
                          Ch6: How to            Ap6.1: The rdfs:range of a qb:DimensionProperty should al-
                          associate a            ways be defined.
Dim. values


                          dimension to its       Ap6.2: If a code list is modelled as skos:ConceptScheme,
                          values?                qb:HierarchicalCodeList, or skos:Collection, then it should be
                                                 associated with the qb:DimensionProperty using the qb:codeList
                                                 property. In addition, the object that is related to the rdfs:range
                                                 property should be set to skos:Concept.
                          Ch7.1: What values Ap7.1a: In case of a specific point in time a new dimension
                          should be used in      should be defined. This dimension should be rdfs:subPropertyOf
                          time related           sdmx-dimension:refPeriod and have rdfs:range xsd:dateTime.
                          dimensions?            Ap7.1b: In case of a period of time, a new dimension should
Common dimension values


                                                 be defined. This dimension should be rdfs:subPropertyOf sdmx-
                                                 dimension:refPeriod and have rdfs:range the interval:Interval
                                                 class of the http://reference.data.gov.uk, which uses this
                                                 class to define years. However, if the approach of http://
                                                 reference.data.gov.uk is not sufficient, then new code lists
                                                 can be also created and used .
                          Ch7.2: What val- Ap7.2:In case of a geography or age a new dimension should
                          ues should be used be defined. This dimension should be rdfs:subPropertyOf the
                          in geospatial dimen- sdmx-dimension:refArea or sdmx-dimension:age respectively.
                          sions?                 The rdfs:range and/or qb:codeList of this dimension should be
                          Ch7.3: What values defined as described in Ap6.2. If a code list or reference dataset
                          should be used in age that addresses the modeling needs exists, then it should be
                          related dimensions? re-used. Otherwise, a new code list should be created.
                          Ch8: How to model Ap8: A single value dimension should be always included in all
Single


                          single value dimen- observations of the cube
                          sions?
                          Ch9.1:     How      to Ap9.1: A code list should be modelled using SKOS. This is also
                          model a new code suggested by the QB vocabulary. Specifically, individual code
                          list?                  values should be modelled using skos:Concept and the overall
                                                 set of values should be modelled using skos:ConceptScheme or
                                                 skos:Collection. Always define a separate code list for each dis-
                                                 tinct set of values (e.g., age groups and geographical areas).
                          Ch9.2:     How      to Ap9.2: In case of hierarchical data, hierarchical code lists should
Code list


                          model     hierarchical be always used to describe them. SKOS should be preferred
                          structures in a code when the hierarchies are simple. In case where the hierarchical
                          list?                  levels are fully separated and depth is a meaningful concept then
                                                 XKOS is appropriate. Finally, when there is a need to express
                                                 more relations that are not covered by SKOS or XKOS (e.g.,
                                                 administeredBy in contrast to within) then the QB vocabulary
                                                 should be preferred.
                          Ch9.3: Should ag- Ap9.3: Aggregate values (e.g., Total) should be included in a
                          gregate values be in- dimension if the measured variable in this dimension can be
                          cluded as dimension aggregated. The aggregate value should be modelled on the top
                          values?                a hierarchy.
measures while the second approach enables the definition of multiple units and
multiple measures. Experts proposed the use of the “measure dimension”.
    The association of a dimension to its potential values (Ch6) can be achieved
using two complementary approaches: i) use the property rdfs:range to define the
class of the values of a qb:DimensionProperty and ii) use the property qb:codeList
to associate a qb:DimensionProperty with a code list. Experts proposed to use
always the rdfs:range and the qb:codeList when a code list is available.
    Some datasets describe a measure using only a single value of a dimension
(Ch8) e.g. census data describe measures for a specific year. The QB vocabulary
enables defining this single value at different levels: i) qb:Dataset, ii) qb:Slice
and iii) qb:Observation. The first does not enable future addition of observations
with a different value for that particular dimension while the second imposes an
extra burden of defining qb:Slices. The last approach is proposed by the experts
since it enables the addition of observations with different dimension values in
the same dataset and the easy re-use of qb:Observations at another context. This
approach has, however, the cost of an increased number of triples.


5   Open challenges

During the Delphi process experts indicated a number of open challenges. These
open challenges regard, limitations of existing standards, lack of standards and
modeling decisions. An important challenge is the definition of code lists for
measures that could be re-used by LOSD publishers. Currently each publisher
defines its own qb:MeasureProperty for the same measure (e.g. p1:unemployment
and p2:unemployment). The definition of a “standard” code list would enable
the publishing of LOSD in a uniform way, thus facilitating the integration and
combination of related statistical data from different sources. Another challenge
is related with the method a measure is calculated. For example, unemployment
can be calculated based on different methods or different base periods. In this
case there is a discussion whether to use the same qb:MeasureProperty or not.
    Composite measures may be derived from other measures [14, 15] e.g. “Un-
employment Rate” as ratio of the number of unemployed people to the total
labour force. This relation should somehow be expressed by linking the two
measures. In this case the computation of aggregated “total” values for com-
posite measures would also be possible. For example the computation of the
“total unemployment rate” based on “male” and “female” unemployment rate.
Currently, there is no LOSD standard to express these relations. Additionally,
having explicitly defined which aggregation functions (e.g. sum, average) are ap-
plicable to a measure is useful for further processing purposes. QB4OLAP [16]
proposes an extension of QB vocabulary to express aggregation function. The
applicability of aggregate functions to measures depends on various factors [17]
(e.g. cube dimensions, units) and needs to be further explored.
    A modeling challenge is related to the definition of multiple measures at
a cube. Our study has shown that cubes with multiple measures should be
published only when measures are closely related to a single observational event
(e.g. sensor measurements). If the measures are independent then they should
be modelled at separate cubes. However, there’s a large grey area between the
two since the ”observational event” is not clearly defined.
    Finally, there are some challenges related with the performance of applica-
tions that consume LOSD. For instance, the use of the qb:codeList indicates all
the potential values of a qb:DimensionProperty. However, it is common to not
use all the values at the cube e.g. a code list may contain values for the geog-
raphy of Europe, but the cube uses only values for Greece. In this case there
is no way to retrieve only the used values from the cube structure. They can
be only retrieved by demanding SPARQL queries that iterate over all the cube
observations. The same case also applies to units of measure.


6   Conclusion
A major part of open data concern statistics. Recently many National Statistical
Institutes and public authorities have adopted the linked data paradigm to pub-
lish LOSD since it facilitates data integration on the Web. Towards this direction
many standard vocabularies have been proposed (i.e., QB, SKOS, XKOS).
    The publication of high quality LOSD can be an important primary material
for added value services, which can increase government transparency, contribute
to economic growth and provide social value. However, the creation of LOSD
remains a demanding task because of modelling challenges. These challenges are
usually addressed by data publishers in an ad hoc manner thus hindering the
publishing of LOSD in a uniform way and lead to the creation of LOSD silos.
As a result LOSD from different sources cannot be easily integrated and generic
software tools cannot be developed.
    Towards this direction, experts that directly participate at the publishing of
LOSD, are involved through an iterative approach in order to comprehend the
modeling challenges, identify relevant publishing approaches and propose ways to
address these challenges. The result is a set of proposed approaches that support
LOSD publishers to model their data and to apply common standards. However
a set of open challenges related with the limitations of existing standards, lack
of standards and modeling decisions still remain to be explored.
    We anticipate that the analysis of the modelling challenges as well as the pro-
posed approaches presented at this paper will trigger and contribute towards a
discussion on the development of best practices for publishing LOSD, facilitating
the combining and analysing of linked statistical data from diverse sources.


Acknowledgement
This paper is an extended abstract of [18] published at the Journal of Web
Semantics. Part of this work was funded by the European Commission within
the H2020 Programme in the context of the project OpenGovIntelligence under
grant agreement no. 693849. The authors would like to cordially thank all the
experts who participated in the study.
References
 1. E. Kalampokis, E. Tambouris, K. Tarabanis, A classification scheme for open gov-
    ernment data: towards linking decentralised data, Int. J. Web Eng. Technol 6 (3)
    (2011) 266—285.
 2. S. Capadisli, S. Auer, A.-C. Ngonga Ngomo, Linked sdmx data, Semantic Web
    6 (2).
 3. E. Kalampokis, E. Tambouris, A. Karamanou, K. Tarabanis, Open statistics:
    The rise of a new era for open data?, in: H. J. Scholl, O. Glassey, M. Janssen,
    B. Klievink, I. Lindgren, P. Parycek, E. Tambouris, M. A. Wimmer, T. Janowski,
    D. Sá Soares (Eds.), Electronic Government, Springer International Publishing,
    Cham, 2016, pp. 31–43.
 4. C. Bizer, T. Heath, T. Berners-Lee, Linked data-the story so far, Semantic Services,
    Interoperability and Web Applications: Emerging Concepts (2009) 205–227.
 5. E. Kalampokis, E. Tambouris, K. Tarabanis, Linked open cube analytics systems:
    Potential and challenges, IEEE Intelligent Systems 31 (5) (2016) 89–92.
 6. C.-C. Hsu, B. A. Sandford, The delphi technique: making sense of consensus, Prac-
    tical assessment, research & evaluation 12 (10) (2007) 1–8.
 7. R. L. Custer, J. A. Scarcella, B. R. Stewart, The modified delphi technique: A
    rotational modification., Journal of Vocational and Technical Education 15 (2)
    (1999) 1 – 10.
 8. R. Cyganiak, M. Hausenblas, E. McCuirc, Official statistics and the practice of
    data fidelity (2011) 135–151.
 9. F. S. Tseng, C.-W. Chen, Integrating heterogeneous data warehouses us-
    ing xml technologies, Journal of Information Science 31 (3) (2005) 209–229.
    doi:10.1177/0165551505052467.
10. S. Berger, M. Schrefl, From federated databases to a federated data warehouse
    system, in: Proceedings of the 41st Annual Hawaii International Conference on
    System Sciences, IEEE, 2008, pp. 394–394.
11. R. Cyganiak, D. Reynolds, The rdf data cube vocabulary:w3c recommendation,
    2014.
12. A. Miles, S. Bechhofer, Skos simple knowledge organization system reference: W3c
    recommendation, 2009.
13. R. Cyganiak, D. Gillman, R. Grim, Y. Jaques, W. Thomas, Xkos: An skos extension
    for representing statistical classifications, 2017.
14. S. F. Pileggi, J. Hunter, An ontological approach to dynamic fine-grained urban
    indicators, Procedia Computer Science 108 (2017) 2059 – 2068, international Con-
    ference on Computational Science, ICCS 2017, 12-14 June 2017, Zurich, Switzer-
    land.
15. M. Denk, W. Grossmann, Towards a best practice of modeling unit of measure
    and related statistical metadata, IMF working paper.
16. J. Varga, A. A. Vaisman, O. Romero, L. Etcheverry, T. B. Pedersen, C. Thomsen,
    Dimensional enrichment of statistical linked open data, Journal of Web Semantics
    40 (2016) 22 – 51. doi:http://dx.doi.org/10.1016/j.websem.2016.07.003.
17. E. Kalampokis, E. Tambouris, K. Tarabanis, ICT tools for creating, expanding,
    and exploiting statistical linked open data, Statistical Journal of the IAOS 33 (2)
    (2017) 503–514.
18. E. Kalampokis, D. Zeginis, K. Tarabanis, On modeling linked open
    statistical data, Journal of Web Semantics 55 (2019) 56 – 68.
    doi:https://doi.org/10.1016/j.websem.2018.11.002.