<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Combining Statistical Data for Machine Learning Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Evangelos Kalampokis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Areti Karamanou</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Konstantinos Tarabanis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Macedonia</institution>
          ,
          <addr-line>Thessaloniki</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Machine learning represents a pragmatic breakthrough in making predictions by nding complex structures and patterns in large volumes of data. Open Statistical Data (OSD), which are highly structured and generally of high quality, can be used in advanced decision making scenarios that involve machine learning analysis. Linked data technologies facilitate the discovery, retrieval, and combination of data on the Web. They enable this way the wide exploitation of OSD in machine learning. A challenge in such analyses is to specify the criteria for selecting the proper datasets to combine and construct a predictive model. This paper presents a case study that aims at creating a model to predict house sales prices in ne grained geographical areas in Scotland using a large variety of Linked Open Statistical Data (LOSD) from the Scottish o cial statistics portal. To this end, we present the machine learning analysis steps that can be enhanced using LOSD and we de ne a set of compatibility criteria. A software tool is also presented as a proof of concept for facilitating the exploitation of LOSD in machine learning. The case study proves the importance of discovering and combining compatible datasets when implementing machine learning scenarios for decision-making.</p>
      </abstract>
      <kwd-group>
        <kwd>statistical data</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Opening up data for others to reuse is a priority in many countries around the
globe. Although the global annual economic potential of open data is estimated
to $3 trillion [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], this potential has been unrealized to a large extent. This is
explained by a number of barriers that hamper the implementation of
sophisticated solutions [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] at the institutional level (e.g. the task complexity of handling
data, legislation, information quality) and technical level [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] .
      </p>
      <p>
        A promising path to overcome open data barriers is to focus on numerical
data and, more speci cally, statistics [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Open Statistical Data (OSD)
constitute a large part of open data [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Their added value is related to the fact that
they are highly structured, hence they can be easily processed. Moreover, they
describe nancial, social, and political aspects of the world, thus playing crucial
role for being a major element in economic and social decision-making [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        However, OSD are barely used in advanced decision-making scenarios that
involve machine learning analysis. Machine learning represents a pragmatic
breakthrough in making predictions by nding complex structures and patterns in
large volumes of data. Recent examples indicating the potential of applying
machine learning in statistical data to support decision making include the identi
cation of important factors related to bicycle crashes [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], analysis of
consumption patterns [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], prediction of crime through both demographic and mobile data
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], de nition of consumer pro le using internal company and statistical data [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        This di culty of using statistical data in advanced machine learning scenarios
can be explained, among others, by the fragmented environment of OSD [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. OSD
are usually provided by Web portals as downloadable les (e.g. CSV, JSON) or
through specialized APIs. In the rst case, data about an indicator are provided
through hundreds, even thousands, of di erent les. For example, searching for
\unemployment" in the UKs o cial open data portal results in more than 2.000
relevant les [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. In the latter case, existing APIs do not address requirements
regarding the combination of data from multiple datasets or sources [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. As
a result combining statistical datasets in order to involve them in advanced
machine learning analysis remains a di cult task.
      </p>
      <p>
        Linked data technologies facilitate discovering, retrieving and combining of
data on the Web by semantically annotating data, creating links between them
and enabling their access using the query language SPARQL. Linked data have
been recently become a W3C standard [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. Indeed, during the last years many
National Statistics Institutes and governments have created Web portals
providing Linked Open Statistical Data (LOSD). Examples include the UK's O ce
for National Statistics1 and the Scottish Government2. Early research in this
area contributed towards this direction (e.g. [
        <xref ref-type="bibr" rid="ref10 ref12 ref16 ref9">12,9,10,16</xref>
        ]). All LOSD portals use
standard Web technologies (e.g. HTTP, RDF, URIs) and vocabularies (e.g. RDF
data cube, SKOS, XKOS).
      </p>
      <p>The large volume and variety of datasets provided by LOSD portals are
necessary in sophisticated machine learning scenarios in order to create predictive
models. A challenge in such scenarios is to specify the criteria that should be
considered when selecting datasets to use in order to solve a speci c problem.</p>
      <p>The aim of this paper is to present a case study that combines LOSD in order
to perform machine learning analysis and support advanced decision-making.
Towards this end, we rst specify the criteria that de ne which datasets can be
used to solve a problem using machine learning. The datasets of our case study
are selected based on these criteria. We also present the Compatible LOSD
Selection tool, a proof of concept of the case study that facilitates the selection
of datasets that will be combined for machine learning analysis.</p>
      <p>The rest of the paper is organised as follows: Section 2 presents the method
of this paper. Section 3 de nes the compatibility criteria. Section 4 presents the
case study and its results. Section 5 presents the Compatible LOSD Selection
tool. Finally, Section 6 concludes and discusses the results.</p>
    </sec>
    <sec id="sec-2">
      <title>1 http://statistics.data.gov.uk</title>
    </sec>
    <sec id="sec-3">
      <title>2 http://statistics.gov.scot</title>
      <sec id="sec-3-1">
        <title>Method</title>
        <p>
          The method used in the case study includes four steps:
1. Problem de nition. The problem de nition step enables users to de ne the
problem they are interested to solve using machine learning analysis. To this
end, the response variable of the predictive model is de ned (including
geographical boundaries, time constraints, units of measure etc.). This requires
exploring the metadata of available datasets. Moreover, the type of the
problem is speci ed (e.g. regression, classi cation etc.). For example, a problem
could be to predict the 2012 house prices in the 2001 data zones of Scotland.
2. Data selection. The data selection step selects the datasets that will be
combined with the response variable and contribute towards solving the problem
de ned in the previous step. The selection of the datasets uses ve structural
criteria based on the granularity of the geographical dimension, the temporal
dimension, the unit of the measure, the type of the measure and additional
dimensions.
3. Feature extraction. This step extracts from the datasets selected in the
previous step numerous features aka predictors. Features are extracted from the
combination of di erent dimensions and measures in one or more datasets.
Dimensions determine and explain a feature. For example, an
unemployment dataset with four dimensions i.e. age group (15-25, 25-54, 55-64), type
of unemployment (cyclical, frictional, structural), measure type (count,
ratio), reference period (2001-Q1, ..., 2016-Q4) could result in 3 x 3 x 2 x 64
=1152 features.
4. Feature selection and model creation. The feature selection step selects among
all extracted features the ones that will be used to construct the predictive
model. Those are features that are signi cantly correlated to the response
variable. Features considered as redundant or irrelevant are ignored. Machine
learning methods to select features include (Least Absolute Shrinkage and
Selection Operator) Lasso[
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], stepwise selection and tree boosting. For our
case study we use the Lasso method to select features. In addition, in order
to assess the result of the machine learning method used to select features,
criteria such as Mean Squared Error (MSE) which measures the average of
the squares of the errors ( i.e. the di erence between the actual and the
predicted value) and the misclassi cation error are commonly used. In our
case study we use Root Mean Squared Error (RMSE) to assess the result of
Lasso.
        </p>
        <p>LOSD contribute in the second step of the methodology by facilitating the
selection of datasets that can be combined with the response variable in order to
construct the predictive model. The next Section speci es the criteria to consider
in order to select compatible LOSD that can contribute in a predictive model as
a response variable or as a feature.</p>
        <p>
          Combining statistical datasets for machine learning
analysis
In general, statistical data are aggregated data that describe a measured fact (e.g.
house prices) in speci c geographical points (e.g. a country, city or building) and
in a speci c period of time (e.g. a year, month, week). In this case, statistical
data are compared to a data cube, where each cell contains a measure or a set
of measures, and thus we can refer to statistical data as data cubes or just cubes
[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. The geographical point and the period of time that describe a measure are
called dimensions (geographical and temporal respectively). A statistical dataset
can be described by additional dimensions as well such as age, gender etc. It is
frequently useful to create a subset of a statistical dataset. This subset xes all
but one (or a small subset) of the initial datasets' dimensions and is called a
slice through the dataset [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
        </p>
        <p>The second step of our methodology requires selecting the slices of statistical
datasets that will contribute as the response variable (also called Y) and also as
the features (also called Xs) of the de ned problem based on:</p>
        <sec id="sec-3-1-1">
          <title>1. The granularity of the geographical dimension.</title>
          <p>2. The temporal dimension.
3. The unit of the measure.
4. The type of the measure.
5. Additional dimensions.</p>
          <p>We specify the above criteria separately for the response variable and the
features. In particular, the selection of the slice that will be used for the response
variable is based on:
1. The granularity of the geographical dimension. Commonly the de ned
problem focuses on geographical points with a speci c granularity level (e.g. to
predict the house prices in the 2001 data zones of Scotland). As a result the
slice selected for the response variable should use this speci c granularity
level. This will be the open dimension of the slice.
2. The temporal dimension. The de ned problem focuses on a speci c period of
time (e.g. to predict the 2012 house prices in the 2001 data zones of Scotland).
As a result the slice selected for the response variable should have the time
dimension xed to the selected period of time.
3. The unit of the measure. Datasets usually use a unit to describe their
measure. Common units of measures are ratio and count. Depending on the
problem slices using ratio or count should be selected. If the selected dataset
includes more than one units of measure the unit dimension should be xed
to the preferred unit of measure.
4. The type of the measure. The measure of a statistical dataset may be
categorical or continuous. Continuous measures contain numbers with in nite
number of values between any two values. Categorical measures contain a
nite number of categories or distinct groups. The nature of the de ned
problem will specify the type of the measure to be selected for the slice of
the response variable.
5. Additional dimensions. Additional dimensions in the selected slice are
desirable (but also optional) as they increase the number of extracted features
that could be used in the construction of more reliable predictive models.
A common additional dimension is, for example, the gender. Additional
dimensions should be also xed to a speci c value.</p>
          <p>In addition the selection of the slices for the features is based on:
1. The granularity of the geographical dimension. The slices selected for the X
variables of the predictive model should have the same granularity level with
the slice of Y. As a result, only datasets that have the same granularity level
in the geographical dimension with the Y variable should be selected.
2. The temporal dimension. Machine learning usually aims to predict a speci c
phenomenon based on historical data. As a result slices selected for the X
variables should refer to the same or past years related to the Y variable.
3. The unit of the measure. Slices using ratio are preferably selected over count
because ratio values are normalized. However, slices with count measures
can be also selected provided that they will be combined with other count
measures in the next step of the methodology (namely feature extraction) in
order to construct new ratio variables. For example, one could select a slice
counting the number of births and also a slice counting the number of deaths
from the data portal of Scotland in order to create in the feature selection
step the ratio `number of births/number of deaths'.
4. The type of the measure. In a predictive model it is not mandatory for the
Y and X variables to have the same type. As a result, when the Y variable
is categorical the Data selection step can select slices with either categorical
or continuous measures for the features and vice versa.
5. Additional dimensions. The selected slice can also have additional
dimensions (e.g. the gender) with same or di erent values related to the respective
dimension of Y.
4</p>
          <p>Case Study: Predicting the House Prices in Scotland
The case study presented in this paper uses datasets from the o cial statistics
data portal of Scotland i.e. http://statistics.gov.scot that was launched
in August 2016. At the time of writing the portal provides access to 220
statistical datasets about Scotland. The datasets can be viewed in variable formats
including tables, maps and charts or downloaded formats like CSV or N-triples
formats. The datasets can be browsed by theme (e.g. Labour Force,
Environment, Transport etc.) or by the organisation that published the dataset (e.g.
Scottish government, SEPA or Transport Scotland).</p>
          <p>Scottish o cial statistics are also provided in Linked Data format using the
W3C's RDF Data Cube Vocabulary3 which allows modelling statistical data
as data cubes. In particular, each dataset in the portal is modelled as a data</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3 https://www.w3.org/TR/vocab-data-cube/</title>
      <p>cube. Each data cube provides multiple ancillary dimensions in complement of
the indicator which is the measure of the data cube. The two most common
dimensions used to describe the datasets are the geographical dimension called
Reference area and the temporal dimension called Reference Period. The
geographical dimension of the datasets is based on a hierarchy of administrative
or consensus-based areas covering from Scottish data zones to electoral awards
and countries. Granularity refers to the levels of depth of the reference area
dimension of each dataset. Some examples of these levels include country, council
areas, electoral wards, and data zones. For example, the house sales dataset4
describes the number of Residential property transactions recorded in di
erent geographical levels of Scotland (e.g. Countries, Electoral Wards, 2001 Data
zones and others) in di erent reference periods (1993-2017). Other commonly
used dimensions include the gender and age group of the population.
Problem de nition The objective of the case study presented in this paper is
to predict the 2012 mean house prices in the 2001 data zones of Scotland. This
is a description of the response variable of our problem.</p>
      <p>2001 Data zones were introduced in 2004 and are the smallest geographical
granularity level in Scotland. They have populations between 500 and 1,000
household residents. Selecting 2001 data zones for our response variable
results in a great number of observations that help to avoid the curse of
highdimensionality (i.e. the state of having less observations than features), create a
more robust model, and predict prices for a speci c district or neighbourhood.</p>
      <p>Regarding the machine learning method used, regression analysis Lasso method
is selected to solve the above described problem. Lasso yields sparse models i.e.
models that involve only a subset of the variables.</p>
      <p>Data selection After the de nition of our problem we rst search for datasets
in the Scottish data portal that can contribute as the response variable of our
problem. The slice selected for the response variable comes from the dataset
House prices of the Scottish portal5 with the temporal dimension xed to 2012,
the measure type xed to mean, and the values of the reference area dimension
coming from the 2001 data zones in Scotland.</p>
      <p>We then search for datasets that can contribute as features. The selected
datasets should be compatible with the response variable. For this reason we
search in the Scottish data portal for datasets based on the compatibility criteria
described in the previous section. In particular, we search for datasets that:
1. The granularity level in their geographical dimension is Scottish 2001 data
zones
2. Their temporal dimension refers to a year in the range 2009-2012
4 http://statistics.gov.scot/resource?uri=http%3A%2F%2Fstatistics.gov.</p>
      <p>scot%2Fdata%2Fhouse-sales
5 http://statistics.gov.scot/resource?uri=http%3A%2F%2Fstatistics.gov.
scot%2Fdata%2Fhouse-sales-prices</p>
      <sec id="sec-4-1">
        <title>3. Their unit of measure is ratio</title>
        <p>4. Have a continuous or categorical measure
5. (Optionally) have additional dimensions</p>
        <p>It should be noted that in this case study we only searched for datasets with
ratio unit of measure. However, as also described in section 3, count datasets
could also be selected provided that they will be transformed in the next step
to ratio values.</p>
        <p>In addition, some datasets may be truly correlated with the response variable
of our case study i.e. the house sales prices and should be excluded. For example,
the Council Tax Bands dataset provides the rate of houses that belong to a
speci c council tax band in each Scottish data zone. This measure is however
actually derived from the price of houses and for this reason we shouldn't include
it in our case study. In reality Council tax bands is a discrete measure, which
means it aggregates the number of houses according to their value.</p>
        <p>The exploration of the Scottish data portal for datasets that satisfy the above
criteria results in the selection of 21 compatible datasets.</p>
        <p>Feature extraction In this step we extract multiple features from each selected
compatible dataset. In our case study each feature is extracted by only one
dataset. For instance, the dimensions of the \Age of First Time Mothers" dataset
which describes the rate of rst time mothers include reference period and age.
For the age dimension three values are used: (1) 19 and under, (2) item 35 and
over, and (3) All. If we also consider that we have selected 4 reference periods
for our datasets (i.e. 2009, 2010, 2011 and 2012), the nal number of features
that can be extracted from this dataset is calculated as:</p>
        <p>2 (the two values of the age dimension - \All" is not included) x 1 (the number
of the di erent unit types) x 4 (number of reference periods) = 8 features.</p>
        <p>The same applies to the rest of the selected datasets in order to extract all
features. The feature extraction step results in 450 features.</p>
        <p>Feature selection and model creation In order to eliminate insigni cant
features we use the regression analysis method called Lasso. The Lasso
implementation was made using the glmnet library6. Lasso keeps only the important
features (i.e. features which add value to our estimations) and removes the rest of
them. A reduced number of features facilitates the interpretation of the results.</p>
        <p>In our case study Lasso results in 34 features coming from 10 datasets. The
initial number of features (i.e. 450) is hence signi cantly reduced (by more than
92%). The 10 datasets selected are:</p>
      </sec>
      <sec id="sec-4-2">
        <title>1. Age of First Time Mothers 2. Ante-Natal Smoking</title>
        <p>6 https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html
3. Breastfeeding
4. Disability Living Allowance
5. Dwellings by Number of Rooms
6. Employment and Support Allowance
7. Hospital Admissions
8. Household Estimates
9. Income And Poverty Modelled Estimates
10. Job Seeker's Allowance Claimants</p>
        <p>Table 1 presents the detailed results of the application of the Lasso analysis
method. We can see that there is no signi cant change between the Lasso lowest
RMSE and the Lasso one standard error.</p>
        <p>Lasso uses the cross-validation method to separate the data in training and
test data and make the prediction. Cross-validation divides the initial dataset
into a number of roughly equal parts (aka folds). In each round of the
crossvalidation, each fold in turn is used as test data and the rest of the folds as
training data. We use the Root Mean Squared Error (RMSE) of the log error to
assess the result of Lasso. Log error is the log of the predicted value minus the
log of the actual value.</p>
        <p>In our case study we randomly select the folds used as test and trained data
(using a seed). This means that each time someone repeats the same procedure
with the same datasets, he/she will result in di erent train and test data and,
hence, in di erent RMSE. We repeat the same Lasso analysis using the same
datasets 100 times (which is also the default value for the gmlnet library) in order
to see the variance of RMSEs during the multiple repetitions. Fig 1 presents two
boxplots. The left boxplot illustrates the variance of the RMSE calculated in
all 100 repeated Lasso experiments. We can see that the median of the RMSEs
is close to 0.273 and that the distance between the median and the lower and
upper quartiles is limited. The right boxplot presents the variation of the total
number of the selected features based on one SE. We can see that the median
number of features is 28 which is also the lower quartile.
Cross-validation allows selecting the best value for the tuning parameter
(lambda), or equivalently, the value of the constraints. To this end, we compute
the lambda parameter. Lambda parameter controls the amount of regularization,
so choosing a good value for it is crucial. In cases with very large number of
features, lasso allows to e ciently nd the model that involves a small subset of
the features. The value selected for lambda is the one that corresponds to the
smallest error or the value with one standard error. The plot in Fig. 2 shows
how the RMSE uctuates for di erent number of lambda (or features). Higher
values of lambda produce less exible functions and, hence, higher errors while
lower values of lambda produce more exible functions and, hence, lower errors.
We select the optimal value for the lambda that corresponds to the minimum
RMSE i.e. -4. Following this rule the nal number of selected features is 34.</p>
        <sec id="sec-4-2-1">
          <title>The Compatible LOSD Selection tool</title>
          <p>We develop an open source tool as a proof of concept of the case study. The tool
o ers an interface that facilitates the selection of compatible statistical datasets
that can be used for machine learning analysis. The tool is based on R Shiny7 and
obtains statistical datasets from the Scottish data portal. It allows selecting a
dataset from the Scottish portal, and searches and presents compatible datasets
based on the de ned compatibility criteria. The tool is available on GitHub 8.</p>
          <p>Fig 3 presents a screen-shot of the Compatible LOSD Selection tool. On the
left panel 2012 house prices has been selected as the rst dataset. On the right
panel the 20 compatible datasets are presented. The selected compatible datasets
can be extracted to contribute in the creation of a predictive model.
Although governments and other organisations are continuously opening up their
statistical data, the potential of open data has been unrealized to a large extent
due to institutional and technical barriers. In machine learning analyses, linked
data facilitate the discovery, retrieval, and combination of data on the Web.
However, a challenge in such analyses is to specify the criteria to be considered
in order to select the proper datasets to construct the predictive model.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>7 https://shiny.rstudio.com/</title>
    </sec>
    <sec id="sec-6">
      <title>8 https://github.com/akaramanou/compatible-LOSD-selection-tool</title>
      <p>In this paper we presented a case study that applied machine learning
methods to compatible statistical datasets from the Scottish data portal in order to
support advanced decision-making scenarios. The case study aimed to predict the
house prices in Scotland. To facilitate the discovery of compatible datasets we
dened ve compatibility criteria. Based on the criteria we discovered 21 datasets
compatible with the response variable. From these datasets we extracted 450
features and applied the Lasso method in order to select the most important
features. We resulted in 34 features coming from only 10 datasets (over 92%
less features than the ones initially identi ed). This means that there is a strong
relationship between the house prices in Scotland and these 10 datasets. We
also developed the Compatible LOSD Selection tool that facilitates discovering
compatible LOSD datasets to perform machine learning analysis.</p>
      <p>This case study is indicative of the importance of using machine learning to
analyse statistical datasets and support decision making. Starting from a
problem that needed to be solved we resulted in identifying relationships between
datasets, some of them previously unknown. For example, our case study
revealed a strong relationship between the breastfeeding percentage and the mean
house prices in Scotland. Other relationships were more obvious such as the one
between Income and Poverty estimates and mean house prices. Eliminating in
an easy way all irrelevant datasets can be really bene cial for decision makers as
it saves them time from dealing with unessential data and help them understand
which variables matter most and which can be ignored. More importantly, this
case study proves that decision makers can yet easily exploit historical statistical
data using machine learning in order to take evidence-based decisions.</p>
      <p>The case study also proved that, when it comes to statistical data, e
ectively discovering compatible datasets is crucial to be able to create successful
predictive models. The compatibility criteria we de ned is only a rst attempt
to de ne the compatibility between statistical datasets. However, this rst
attempt proved that discovering compatible datasets forms the basis to extract
meaningful results using machine learning.</p>
      <p>Acknowledgments. This research is co- nanced by Greece and the European
Union (European Social Fund- ESF) through the Operational Program \Human
Resources Development, Education and Lifelong Learning 2014-2020" in the
context of the project \Integrating open statistical data using semantic technologies"
(MIS 5007306).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bogomolov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lepri</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Staiano</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oliver</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pianesi</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pentland</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Once upon a crime: towards crime prediction from demographics and mobile data</article-title>
          .
          <source>In: Proceedings of the 16th international conference on multimodal interaction</source>
          , pp.
          <volume>427</volume>
          {
          <fpage>434</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Coleman</surname>
            ,
            <given-names>S.Y.</given-names>
          </string-name>
          :
          <article-title>Data-mining opportunities for small and medium enterprises with o cial statistics in the UK</article-title>
          .
          <source>Journal of O cial Statistics</source>
          <volume>32</volume>
          (
          <issue>4</issue>
          ),
          <volume>849</volume>
          {
          <fpage>865</fpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Cyganiak</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reynolds</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tennison</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>The rdf data cube vocabulary</article-title>
          .
          <source>W3C Recommendation</source>
          ,
          <source>W3C</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Datta</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thomas</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          :
          <article-title>The cube data model: a conceptual model and algebra for on-line analytical processing in data warehouses</article-title>
          .
          <source>Decision Support Systems</source>
          <volume>27</volume>
          (
          <issue>3</issue>
          ),
          <volume>289</volume>
          {
          <fpage>301</fpage>
          (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Degirmenci</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          , Ozbak r, L.:
          <article-title>Di erentiating households to analyze consumption patterns: a data mining study on o cial household budget data</article-title>
          .
          <source>Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery</source>
          <volume>8</volume>
          (
          <issue>1</issue>
          ) (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>European</surname>
          </string-name>
          <article-title>Commission: Guidelines on recommended standard licences, datasets and charging for the reuse of documents (</article-title>
          <year>2014</year>
          ).
          <source>C240/1</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Hassani</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Saporta</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Silva</surname>
            ,
            <given-names>E.S.</given-names>
          </string-name>
          :
          <article-title>Data mining and o cial statistics: the past, the present and the future</article-title>
          .
          <source>Big Data</source>
          <volume>2</volume>
          (
          <issue>1</issue>
          ),
          <volume>34</volume>
          {
          <fpage>43</fpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Janssen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Charalabidis</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zuiderwijk</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Bene ts, adoption barriers and myths of open data and open government</article-title>
          .
          <source>Information systems management 29(4)</source>
          ,
          <volume>258</volume>
          {
          <fpage>268</fpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Kalampokis</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nikolov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haase</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cyganiak</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stasiewicz</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karamanou</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zotou</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zeginis</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tambouris</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tarabanis</surname>
            ,
            <given-names>K.A.</given-names>
          </string-name>
          :
          <article-title>Exploiting linked data cubes with opencube toolkit</article-title>
          .
          <source>In: International Semantic Web Conference (Posters &amp; Demos)</source>
          , vol.
          <volume>1272</volume>
          , pp.
          <volume>137</volume>
          {
          <issue>140</issue>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Kalampokis</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roberts</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karamanou</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tambouris</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tarabanis</surname>
            ,
            <given-names>K.A.:</given-names>
          </string-name>
          <article-title>Challenges on developing tools for exploiting linked open data cubes</article-title>
          . In: SemStats@ ISWC (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Kalampokis</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tambouris</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karamanou</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tarabanis</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          : Open statistics:
          <article-title>The rise of a new era for open data?</article-title>
          <source>In: International Conference on Electronic Government and the Information Systems Perspective</source>
          , pp.
          <volume>31</volume>
          {
          <fpage>43</fpage>
          . Springer (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Kalampokis</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tambouris</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tarabanis</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Linked open government data analytics</article-title>
          .
          <source>In: International Conference on Electronic Government</source>
          , pp.
          <volume>99</volume>
          {
          <fpage>110</fpage>
          . Springer (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Kalampokis</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tambouris</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tarabanis</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Linked open cube analytics systems: Potential and challenges</article-title>
          .
          <source>IEEE Intelligent Systems</source>
          <volume>31</volume>
          (
          <issue>5</issue>
          ),
          <volume>89</volume>
          {
          <fpage>92</fpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Manyika</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chui</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Groves</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Farrell</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Van Kuiken</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Doshi</surname>
            ,
            <given-names>E.A.</given-names>
          </string-name>
          :
          <article-title>Open data: Unlocking innovation and performance with liquid information</article-title>
          .
          <source>McKinsey Global Institute</source>
          <volume>21</volume>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Prati</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pietrantoni</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fraboni</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Using data mining techniques to predict the severity of bicycle crashes</article-title>
          .
          <source>Accident Analysis &amp; Prevention</source>
          <volume>101</volume>
          ,
          <issue>44</issue>
          {
          <fpage>54</fpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Tambouris</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kalampokis</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tarabanis</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Processing linked open data cubes</article-title>
          .
          <source>In: International Conference on Electronic Government</source>
          , pp.
          <volume>130</volume>
          {
          <fpage>143</fpage>
          . Springer (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Tibshirani</surname>
          </string-name>
          , R.:
          <article-title>Regression shrinkage and selection via the lasso</article-title>
          .
          <source>Journal of the Royal Statistical Society</source>
          . Series B (Methodological) pp.
          <volume>267</volume>
          {
          <issue>288</issue>
          (
          <year>1996</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <article-title>W3C: Data on the web best practices (</article-title>
          <year>2017</year>
          ). URL https://www.w3.org/TR/ dwbp/. W3C Recommendation
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Zeginis</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kalampokis</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roberts</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moynihan</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tambouris</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tarabanis</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Facilitating the exploitation of linked open statistical data: JSON-QB API requirements and design criteria</article-title>
          . In: 5th International Workshop on Semantic Statistics (
          <article-title>SemStats2017) co-located with the 16th</article-title>
          <source>International Semantic Web Conference (ISWC2017)</source>
          , vol.
          <source>1923</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>Y.Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kindarto</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>A garbage can model of government it project failures in developing countries: The e ects of leadership, decision structure and team competence</article-title>
          .
          <source>Government Information Quarterly</source>
          <volume>33</volume>
          (
          <issue>4</issue>
          ),
          <volume>629</volume>
          {
          <fpage>637</fpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>