Combining Statistical Data for Machine
                Learning Analysis

    Evangelos Kalampokis, Areti Karamanou, and Konstantinos Tarabanis

                   University of Macedonia, Thessaloniki, Greece
                          {ekal,akarm,kat}@uom.edu.gr


      Abstract. Machine learning represents a pragmatic breakthrough in
      making predictions by finding complex structures and patterns in large
      volumes of data. Open Statistical Data (OSD), which are highly struc-
      tured and generally of high quality, can be used in advanced decision
      making scenarios that involve machine learning analysis. Linked data
      technologies facilitate the discovery, retrieval, and combination of data
      on the Web. They enable this way the wide exploitation of OSD in ma-
      chine learning. A challenge in such analyses is to specify the criteria
      for selecting the proper datasets to combine and construct a predictive
      model. This paper presents a case study that aims at creating a model
      to predict house sales prices in fine grained geographical areas in Scot-
      land using a large variety of Linked Open Statistical Data (LOSD) from
      the Scottish official statistics portal. To this end, we present the machine
      learning analysis steps that can be enhanced using LOSD and we define a
      set of compatibility criteria. A software tool is also presented as a proof
      of concept for facilitating the exploitation of LOSD in machine learn-
      ing. The case study proves the importance of discovering and combining
      compatible datasets when implementing machine learning scenarios for
      decision-making.

      Keywords: statistical data · machine learning · compatibility.


1   Introduction

Opening up data for others to reuse is a priority in many countries around the
globe. Although the global annual economic potential of open data is estimated
to $3 trillion [14], this potential has been unrealized to a large extent. This is
explained by a number of barriers that hamper the implementation of sophisti-
cated solutions [20] at the institutional level (e.g. the task complexity of handling
data, legislation, information quality) and technical level [8] .
    A promising path to overcome open data barriers is to focus on numerical
data and, more specifically, statistics [11]. Open Statistical Data (OSD) consti-
tute a large part of open data [6]. Their added value is related to the fact that
they are highly structured, hence they can be easily processed. Moreover, they
describe financial, social, and political aspects of the world, thus playing crucial
role for being a major element in economic and social decision-making [7].
2       E. Kalampokis et al.

     However, OSD are barely used in advanced decision-making scenarios that in-
volve machine learning analysis. Machine learning represents a pragmatic break-
through in making predictions by finding complex structures and patterns in
large volumes of data. Recent examples indicating the potential of applying ma-
chine learning in statistical data to support decision making include the identifi-
cation of important factors related to bicycle crashes [15], analysis of consump-
tion patterns [5], prediction of crime through both demographic and mobile data
[1], definition of consumer profile using internal company and statistical data [2].
     This difficulty of using statistical data in advanced machine learning scenarios
can be explained, among others, by the fragmented environment of OSD [7]. OSD
are usually provided by Web portals as downloadable files (e.g. CSV, JSON) or
through specialized APIs. In the first case, data about an indicator are provided
through hundreds, even thousands, of different files. For example, searching for
“unemployment” in the UKs official open data portal results in more than 2.000
relevant files [13]. In the latter case, existing APIs do not address requirements
regarding the combination of data from multiple datasets or sources [19]. As
a result combining statistical datasets in order to involve them in advanced
machine learning analysis remains a difficult task.
     Linked data technologies facilitate discovering, retrieving and combining of
data on the Web by semantically annotating data, creating links between them
and enabling their access using the query language SPARQL. Linked data have
been recently become a W3C standard [18]. Indeed, during the last years many
National Statistics Institutes and governments have created Web portals pro-
viding Linked Open Statistical Data (LOSD). Examples include the UK’s Office
for National Statistics1 and the Scottish Government2 . Early research in this
area contributed towards this direction (e.g. [12,9,10,16]). All LOSD portals use
standard Web technologies (e.g. HTTP, RDF, URIs) and vocabularies (e.g. RDF
data cube, SKOS, XKOS).
     The large volume and variety of datasets provided by LOSD portals are nec-
essary in sophisticated machine learning scenarios in order to create predictive
models. A challenge in such scenarios is to specify the criteria that should be
considered when selecting datasets to use in order to solve a specific problem.
     The aim of this paper is to present a case study that combines LOSD in order
to perform machine learning analysis and support advanced decision-making.
Towards this end, we first specify the criteria that define which datasets can be
used to solve a problem using machine learning. The datasets of our case study
are selected based on these criteria. We also present the Compatible LOSD
Selection tool, a proof of concept of the case study that facilitates the selection
of datasets that will be combined for machine learning analysis.
     The rest of the paper is organised as follows: Section 2 presents the method
of this paper. Section 3 defines the compatibility criteria. Section 4 presents the
case study and its results. Section 5 presents the Compatible LOSD Selection
tool. Finally, Section 6 concludes and discusses the results.
1
    http://statistics.data.gov.uk
2
    http://statistics.gov.scot
                  Combining Statistical Data for Machine Learning Analysis        3

2   Method

The method used in the case study includes four steps:

 1. Problem definition. The problem definition step enables users to define the
    problem they are interested to solve using machine learning analysis. To this
    end, the response variable of the predictive model is defined (including geo-
    graphical boundaries, time constraints, units of measure etc.). This requires
    exploring the metadata of available datasets. Moreover, the type of the prob-
    lem is specified (e.g. regression, classification etc.). For example, a problem
    could be to predict the 2012 house prices in the 2001 data zones of Scotland.
 2. Data selection. The data selection step selects the datasets that will be com-
    bined with the response variable and contribute towards solving the problem
    defined in the previous step. The selection of the datasets uses five structural
    criteria based on the granularity of the geographical dimension, the temporal
    dimension, the unit of the measure, the type of the measure and additional
    dimensions.
 3. Feature extraction. This step extracts from the datasets selected in the pre-
    vious step numerous features aka predictors. Features are extracted from the
    combination of different dimensions and measures in one or more datasets.
    Dimensions determine and explain a feature. For example, an unemploy-
    ment dataset with four dimensions i.e. age group (15-25, 25-54, 55-64), type
    of unemployment (cyclical, frictional, structural), measure type (count, ra-
    tio), reference period (2001-Q1, ..., 2016-Q4) could result in 3 x 3 x 2 x 64
    =1152 features.
 4. Feature selection and model creation. The feature selection step selects among
    all extracted features the ones that will be used to construct the predictive
    model. Those are features that are significantly correlated to the response
    variable. Features considered as redundant or irrelevant are ignored. Machine
    learning methods to select features include (Least Absolute Shrinkage and
    Selection Operator) Lasso[17], stepwise selection and tree boosting. For our
    case study we use the Lasso method to select features. In addition, in order
    to assess the result of the machine learning method used to select features,
    criteria such as Mean Squared Error (MSE) which measures the average of
    the squares of the errors ( i.e. the difference between the actual and the
    predicted value) and the misclassification error are commonly used. In our
    case study we use Root Mean Squared Error (RMSE) to assess the result of
    Lasso.

    LOSD contribute in the second step of the methodology by facilitating the
selection of datasets that can be combined with the response variable in order to
construct the predictive model. The next Section specifies the criteria to consider
in order to select compatible LOSD that can contribute in a predictive model as
a response variable or as a feature.
4       E. Kalampokis et al.

3    Combining statistical datasets for machine learning
     analysis
In general, statistical data are aggregated data that describe a measured fact (e.g.
house prices) in specific geographical points (e.g. a country, city or building) and
in a specific period of time (e.g. a year, month, week). In this case, statistical
data are compared to a data cube, where each cell contains a measure or a set
of measures, and thus we can refer to statistical data as data cubes or just cubes
[4]. The geographical point and the period of time that describe a measure are
called dimensions (geographical and temporal respectively). A statistical dataset
can be described by additional dimensions as well such as age, gender etc. It is
frequently useful to create a subset of a statistical dataset. This subset fixes all
but one (or a small subset) of the initial datasets’ dimensions and is called a
slice through the dataset [3].
    The second step of our methodology requires selecting the slices of statistical
datasets that will contribute as the response variable (also called Y) and also as
the features (also called Xs) of the defined problem based on:
 1. The granularity of the geographical dimension.
 2. The temporal dimension.
 3. The unit of the measure.
 4. The type of the measure.
 5. Additional dimensions.
    We specify the above criteria separately for the response variable and the
features. In particular, the selection of the slice that will be used for the response
variable is based on:
 1. The granularity of the geographical dimension. Commonly the defined prob-
    lem focuses on geographical points with a specific granularity level (e.g. to
    predict the house prices in the 2001 data zones of Scotland). As a result the
    slice selected for the response variable should use this specific granularity
    level. This will be the open dimension of the slice.
 2. The temporal dimension. The defined problem focuses on a specific period of
    time (e.g. to predict the 2012 house prices in the 2001 data zones of Scotland).
    As a result the slice selected for the response variable should have the time
    dimension fixed to the selected period of time.
 3. The unit of the measure. Datasets usually use a unit to describe their mea-
    sure. Common units of measures are ratio and count. Depending on the
    problem slices using ratio or count should be selected. If the selected dataset
    includes more than one units of measure the unit dimension should be fixed
    to the preferred unit of measure.
 4. The type of the measure. The measure of a statistical dataset may be cat-
    egorical or continuous. Continuous measures contain numbers with infinite
    number of values between any two values. Categorical measures contain a
    finite number of categories or distinct groups. The nature of the defined
    problem will specify the type of the measure to be selected for the slice of
    the response variable.
                   Combining Statistical Data for Machine Learning Analysis      5

5. Additional dimensions. Additional dimensions in the selected slice are desir-
   able (but also optional) as they increase the number of extracted features
   that could be used in the construction of more reliable predictive models.
   A common additional dimension is, for example, the gender. Additional di-
   mensions should be also fixed to a specific value.

     In addition the selection of the slices for the features is based on:

1. The granularity of the geographical dimension. The slices selected for the X
   variables of the predictive model should have the same granularity level with
   the slice of Y. As a result, only datasets that have the same granularity level
   in the geographical dimension with the Y variable should be selected.
2. The temporal dimension. Machine learning usually aims to predict a specific
   phenomenon based on historical data. As a result slices selected for the X
   variables should refer to the same or past years related to the Y variable.
3. The unit of the measure. Slices using ratio are preferably selected over count
   because ratio values are normalized. However, slices with count measures
   can be also selected provided that they will be combined with other count
   measures in the next step of the methodology (namely feature extraction) in
   order to construct new ratio variables. For example, one could select a slice
   counting the number of births and also a slice counting the number of deaths
   from the data portal of Scotland in order to create in the feature selection
   step the ratio ‘number of births/number of deaths’.
4. The type of the measure. In a predictive model it is not mandatory for the
   Y and X variables to have the same type. As a result, when the Y variable
   is categorical the Data selection step can select slices with either categorical
   or continuous measures for the features and vice versa.
5. Additional dimensions. The selected slice can also have additional dimen-
   sions (e.g. the gender) with same or different values related to the respective
   dimension of Y.


4     Case Study: Predicting the House Prices in Scotland

The case study presented in this paper uses datasets from the official statistics
data portal of Scotland i.e. http://statistics.gov.scot that was launched
in August 2016. At the time of writing the portal provides access to 220 statis-
tical datasets about Scotland. The datasets can be viewed in variable formats
including tables, maps and charts or downloaded formats like CSV or N-triples
formats. The datasets can be browsed by theme (e.g. Labour Force, Environ-
ment, Transport etc.) or by the organisation that published the dataset (e.g.
Scottish government, SEPA or Transport Scotland).
    Scottish official statistics are also provided in Linked Data format using the
W3C’s RDF Data Cube Vocabulary3 which allows modelling statistical data
as data cubes. In particular, each dataset in the portal is modelled as a data
3
    https://www.w3.org/TR/vocab-data-cube/
6      E. Kalampokis et al.

cube. Each data cube provides multiple ancillary dimensions in complement of
the indicator which is the measure of the data cube. The two most common
dimensions used to describe the datasets are the geographical dimension called
Reference area and the temporal dimension called Reference Period. The geo-
graphical dimension of the datasets is based on a hierarchy of administrative
or consensus-based areas covering from Scottish data zones to electoral awards
and countries. Granularity refers to the levels of depth of the reference area di-
mension of each dataset. Some examples of these levels include country, council
areas, electoral wards, and data zones. For example, the house sales dataset4
describes the number of Residential property transactions recorded in differ-
ent geographical levels of Scotland (e.g. Countries, Electoral Wards, 2001 Data
zones and others) in different reference periods (1993-2017). Other commonly
used dimensions include the gender and age group of the population.

Problem definition The objective of the case study presented in this paper is
to predict the 2012 mean house prices in the 2001 data zones of Scotland. This
is a description of the response variable of our problem.
    2001 Data zones were introduced in 2004 and are the smallest geographical
granularity level in Scotland. They have populations between 500 and 1,000
household residents. Selecting 2001 data zones for our response variable re-
sults in a great number of observations that help to avoid the curse of high-
dimensionality (i.e. the state of having less observations than features), create a
more robust model, and predict prices for a specific district or neighbourhood.
    Regarding the machine learning method used, regression analysis Lasso method
is selected to solve the above described problem. Lasso yields sparse models i.e.
models that involve only a subset of the variables.

Data selection After the definition of our problem we first search for datasets
in the Scottish data portal that can contribute as the response variable of our
problem. The slice selected for the response variable comes from the dataset
House prices of the Scottish portal5 with the temporal dimension fixed to 2012,
the measure type fixed to mean, and the values of the reference area dimension
coming from the 2001 data zones in Scotland.
    We then search for datasets that can contribute as features. The selected
datasets should be compatible with the response variable. For this reason we
search in the Scottish data portal for datasets based on the compatibility criteria
described in the previous section. In particular, we search for datasets that:

 1. The granularity level in their geographical dimension is Scottish 2001 data
    zones
 2. Their temporal dimension refers to a year in the range 2009-2012
4
  http://statistics.gov.scot/resource?uri=http%3A%2F%2Fstatistics.gov.
  scot%2Fdata%2Fhouse-sales
5
  http://statistics.gov.scot/resource?uri=http%3A%2F%2Fstatistics.gov.
  scot%2Fdata%2Fhouse-sales-prices
                  Combining Statistical Data for Machine Learning Analysis       7

 3. Their unit of measure is ratio
 4. Have a continuous or categorical measure
 5. (Optionally) have additional dimensions

    It should be noted that in this case study we only searched for datasets with
ratio unit of measure. However, as also described in section 3, count datasets
could also be selected provided that they will be transformed in the next step
to ratio values.
    In addition, some datasets may be truly correlated with the response variable
of our case study i.e. the house sales prices and should be excluded. For example,
the Council Tax Bands dataset provides the rate of houses that belong to a
specific council tax band in each Scottish data zone. This measure is however
actually derived from the price of houses and for this reason we shouldn’t include
it in our case study. In reality Council tax bands is a discrete measure, which
means it aggregates the number of houses according to their value.
    The exploration of the Scottish data portal for datasets that satisfy the above
criteria results in the selection of 21 compatible datasets.

Feature extraction In this step we extract multiple features from each selected
compatible dataset. In our case study each feature is extracted by only one
dataset. For instance, the dimensions of the “Age of First Time Mothers” dataset
which describes the rate of first time mothers include reference period and age.
For the age dimension three values are used: (1) 19 and under, (2) item 35 and
over, and (3) All. If we also consider that we have selected 4 reference periods
for our datasets (i.e. 2009, 2010, 2011 and 2012), the final number of features
that can be extracted from this dataset is calculated as:

    2 (the two values of the age dimension - “All” is not included) x 1 (the number
of the different unit types) x 4 (number of reference periods) = 8 features.

    The same applies to the rest of the selected datasets in order to extract all
features. The feature extraction step results in 450 features.

Feature selection and model creation In order to eliminate insignificant
features we use the regression analysis method called Lasso. The Lasso imple-
mentation was made using the glmnet library6 . Lasso keeps only the important
features (i.e. features which add value to our estimations) and removes the rest of
them. A reduced number of features facilitates the interpretation of the results.
    In our case study Lasso results in 34 features coming from 10 datasets. The
initial number of features (i.e. 450) is hence significantly reduced (by more than
92%). The 10 datasets selected are:
 1. Age of First Time Mothers
 2. Ante-Natal Smoking
6
    https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html
8       E. Kalampokis et al.

 3. Breastfeeding
 4. Disability Living Allowance
 5. Dwellings by Number of Rooms
 6. Employment and Support Allowance
 7. Hospital Admissions
 8. Household Estimates
 9. Income And Poverty Modelled Estimates
10. Job Seeker’s Allowance Claimants
   Table 1 presents the detailed results of the application of the Lasso analysis
method. We can see that there is no significant change between the Lasso lowest
RMSE and the Lasso one standard error.


               Table 1: The results of the Lasso analysis method
                    Number of used Observations       5380
                    Number of Predictors              450
                    Type of Predictors                Ratio
                    Year of Predictors                2009-2012
                    Type of Response                  Mean
                    Year of Response                  2012
                    Lasso lowest RMSE                 0.2664764
                    With number of selected variables 47
                    Lasso RMSE 1SE                    0.2737672
                    With number of selected variables 34
                    Percentage of reduction           92%


    Lasso uses the cross-validation method to separate the data in training and
test data and make the prediction. Cross-validation divides the initial dataset
into a number of roughly equal parts (aka folds). In each round of the cross-
validation, each fold in turn is used as test data and the rest of the folds as
training data. We use the Root Mean Squared Error (RMSE) of the log error to
assess the result of Lasso. Log error is the log of the predicted value minus the
log of the actual value.
    In our case study we randomly select the folds used as test and trained data
(using a seed). This means that each time someone repeats the same procedure
with the same datasets, he/she will result in different train and test data and,
hence, in different RMSE. We repeat the same Lasso analysis using the same
datasets 100 times (which is also the default value for the gmlnet library) in order
to see the variance of RMSEs during the multiple repetitions. Fig 1 presents two
boxplots. The left boxplot illustrates the variance of the RMSE calculated in
all 100 repeated Lasso experiments. We can see that the median of the RMSEs
is close to 0.273 and that the distance between the median and the lower and
upper quartiles is limited. The right boxplot presents the variation of the total
number of the selected features based on one SE. We can see that the median
number of features is 28 which is also the lower quartile.
                  Combining Statistical Data for Machine Learning Analysis        9


       Fig. 1: Variance of RMSEs and total number of selected features


    Cross-validation allows selecting the best value for the tuning parameter
(lambda), or equivalently, the value of the constraints. To this end, we compute
the lambda parameter. Lambda parameter controls the amount of regularization,
so choosing a good value for it is crucial. In cases with very large number of
features, lasso allows to efficiently find the model that involves a small subset of
the features. The value selected for lambda is the one that corresponds to the
smallest error or the value with one standard error. The plot in Fig. 2 shows
how the RMSE fluctuates for different number of lambda (or features). Higher
values of lambda produce less flexible functions and, hence, higher errors while
lower values of lambda produce more flexible functions and, hence, lower errors.
We select the optimal value for the lambda that corresponds to the minimum
RMSE i.e. -4. Following this rule the final number of selected features is 34.


                             Fig. 2: Lambda - RMSE
10       E. Kalampokis et al.

5      The Compatible LOSD Selection tool

We develop an open source tool as a proof of concept of the case study. The tool
offers an interface that facilitates the selection of compatible statistical datasets
that can be used for machine learning analysis. The tool is based on R Shiny7 and
obtains statistical datasets from the Scottish data portal. It allows selecting a
dataset from the Scottish portal, and searches and presents compatible datasets
based on the defined compatibility criteria. The tool is available on GitHub 8 .
    Fig 3 presents a screen-shot of the Compatible LOSD Selection tool. On the
left panel 2012 house prices has been selected as the first dataset. On the right
panel the 20 compatible datasets are presented. The selected compatible datasets
can be extracted to contribute in the creation of a predictive model.


                   Fig. 3: The Compatible LOSD Selection tool


6      Conclusions

Although governments and other organisations are continuously opening up their
statistical data, the potential of open data has been unrealized to a large extent
due to institutional and technical barriers. In machine learning analyses, linked
data facilitate the discovery, retrieval, and combination of data on the Web.
However, a challenge in such analyses is to specify the criteria to be considered
in order to select the proper datasets to construct the predictive model.
7
     https://shiny.rstudio.com/
8
     https://github.com/akaramanou/compatible-LOSD-selection-tool
                   Combining Statistical Data for Machine Learning Analysis         11

    In this paper we presented a case study that applied machine learning meth-
ods to compatible statistical datasets from the Scottish data portal in order to
support advanced decision-making scenarios. The case study aimed to predict the
house prices in Scotland. To facilitate the discovery of compatible datasets we de-
fined five compatibility criteria. Based on the criteria we discovered 21 datasets
compatible with the response variable. From these datasets we extracted 450
features and applied the Lasso method in order to select the most important
features. We resulted in 34 features coming from only 10 datasets (over 92%
less features than the ones initially identified). This means that there is a strong
relationship between the house prices in Scotland and these 10 datasets. We
also developed the Compatible LOSD Selection tool that facilitates discovering
compatible LOSD datasets to perform machine learning analysis.
    This case study is indicative of the importance of using machine learning to
analyse statistical datasets and support decision making. Starting from a prob-
lem that needed to be solved we resulted in identifying relationships between
datasets, some of them previously unknown. For example, our case study re-
vealed a strong relationship between the breastfeeding percentage and the mean
house prices in Scotland. Other relationships were more obvious such as the one
between Income and Poverty estimates and mean house prices. Eliminating in
an easy way all irrelevant datasets can be really beneficial for decision makers as
it saves them time from dealing with unessential data and help them understand
which variables matter most and which can be ignored. More importantly, this
case study proves that decision makers can yet easily exploit historical statistical
data using machine learning in order to take evidence-based decisions.
    The case study also proved that, when it comes to statistical data, effec-
tively discovering compatible datasets is crucial to be able to create successful
predictive models. The compatibility criteria we defined is only a first attempt
to define the compatibility between statistical datasets. However, this first at-
tempt proved that discovering compatible datasets forms the basis to extract
meaningful results using machine learning.


Acknowledgments. This research is co-financed by Greece and the European
Union (European Social Fund- ESF) through the Operational Program “Human
Resources Development, Education and Lifelong Learning 2014-2020” in the con-
text of the project “Integrating open statistical data using semantic technologies”
(MIS 5007306).


References
 1. Bogomolov, A., Lepri, B., Staiano, J., Oliver, N., Pianesi, F., Pentland, A.: Once
    upon a crime: towards crime prediction from demographics and mobile data. In:
    Proceedings of the 16th international conference on multimodal interaction, pp.
    427–434. ACM (2014)
 2. Coleman, S.Y.: Data-mining opportunities for small and medium enterprises with
    official statistics in the UK. Journal of Official Statistics 32(4), 849–865 (2016)
12      E. Kalampokis et al.

 3. Cyganiak, R., Reynolds, D., Tennison, J.: The rdf data cube vocabulary. W3C
    Recommendation, W3C (2014)
 4. Datta, A., Thomas, H.: The cube data model: a conceptual model and algebra for
    on-line analytical processing in data warehouses. Decision Support Systems 27(3),
    289–301 (1999)
 5. Değirmenci, T., Özbakır, L.: Differentiating households to analyze consumption
    patterns: a data mining study on official household budget data. Wiley Interdisci-
    plinary Reviews: Data Mining and Knowledge Discovery 8(1) (2018)
 6. European Commission: Guidelines on recommended standard licences, datasets
    and charging for the reuse of documents (2014). C240/1
 7. Hassani, H., Saporta, G., Silva, E.S.: Data mining and official statistics: the past,
    the present and the future. Big Data 2(1), 34–43 (2014)
 8. Janssen, M., Charalabidis, Y., Zuiderwijk, A.: Benefits, adoption barriers and
    myths of open data and open government. Information systems management 29(4),
    258–268 (2012)
 9. Kalampokis, E., Nikolov, A., Haase, P., Cyganiak, R., Stasiewicz, A., Karamanou,
    A., Zotou, M., Zeginis, D., Tambouris, E., Tarabanis, K.A.: Exploiting linked data
    cubes with opencube toolkit. In: International Semantic Web Conference (Posters
    & Demos), vol. 1272, pp. 137–140 (2014)
10. Kalampokis, E., Roberts, B., Karamanou, A., Tambouris, E., Tarabanis, K.A.:
    Challenges on developing tools for exploiting linked open data cubes. In: Sem-
    Stats@ ISWC (2015)
11. Kalampokis, E., Tambouris, E., Karamanou, A., Tarabanis, K.: Open statistics:
    The rise of a new era for open data? In: International Conference on Electronic
    Government and the Information Systems Perspective, pp. 31–43. Springer (2016)
12. Kalampokis, E., Tambouris, E., Tarabanis, K.: Linked open government data
    analytics. In: International Conference on Electronic Government, pp. 99–110.
    Springer (2013)
13. Kalampokis, E., Tambouris, E., Tarabanis, K.: Linked open cube analytics systems:
    Potential and challenges. IEEE Intelligent Systems 31(5), 89–92 (2016)
14. Manyika, J., Chui, M., Groves, P., Farrell, D., Van Kuiken, S., Doshi, E.A.: Open
    data: Unlocking innovation and performance with liquid information. McKinsey
    Global Institute 21 (2013)
15. Prati, G., Pietrantoni, L., Fraboni, F.: Using data mining techniques to predict the
    severity of bicycle crashes. Accident Analysis & Prevention 101, 44–54 (2017)
16. Tambouris, E., Kalampokis, E., Tarabanis, K.: Processing linked open data cubes.
    In: International Conference on Electronic Government, pp. 130–143. Springer
    (2015)
17. Tibshirani, R.: Regression shrinkage and selection via the lasso. Journal of the
    Royal Statistical Society. Series B (Methodological) pp. 267–288 (1996)
18. W3C: Data on the web best practices (2017). URL https://www.w3.org/TR/
    dwbp/. W3C Recommendation
19. Zeginis, D., Kalampokis, E., Roberts, B., Moynihan, R., Tambouris, E., Tarabanis,
    K.: Facilitating the exploitation of linked open statistical data: JSON-QB API
    requirements and design criteria. In: 5th International Workshop on Semantic
    Statistics (SemStats2017) co-located with the 16th International Semantic Web
    Conference (ISWC2017), vol. 1923 (2017)
20. Zhu, Y.Q., Kindarto, A.: A garbage can model of government it project failures
    in developing countries: The effects of leadership, decision structure and team
    competence. Government Information Quarterly 33(4), 629–637 (2016)