<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards a Vocabulary for Incorporating Predictive Models into the Linked Data Web</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Evangelos Kalampokis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Areti Karamanou</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Efthimios Tambouris</string-name>
          <email>tambouris@uom.gr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Konstantinos Tarabanis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Informatics and Telematics Institute, Centre for Research &amp; Technology - Hellas 6th km Xarilaou - Thermi</institution>
          ,
          <addr-line>57001, Thessaloniki</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Information Systems Lab, University of Macedonia</institution>
          ,
          <addr-line>Egnatia 156, 54006 Thessaloniki</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Predictive modeling re ects the process of using data and statistical or data mining methods for predicting new observations. The predictive models that are created out of this process could be reused in di erent applications in the same sense that open data is reused. Towards this end, a few standards have been proposed in order to enable transfer of predictive models across platforms and applications. In this paper we suggest the need for incorporating predictive models into the Linked Data Web. Towards this end, we propose an RDF Schema vocabulary that will enable the creation of predictive models descriptions adhering to the Linked Data principles. The incorporation of these descriptions into the Linked Data Web could create new potentials beyond cross-platform model reuse. In particular, it will enable (a) easy discovery and reuse of appropriate models at a Web Scale and (b) creation of more accurate models exploiting connections of models to other models, datasets and other resources on the Web.</p>
      </abstract>
      <kwd-group>
        <kwd>Linked data</kwd>
        <kwd>statistical data</kwd>
        <kwd>predictive analytics</kwd>
        <kwd>vocabulary</kwd>
        <kwd>RDF</kwd>
        <kwd>predictive model</kwd>
        <kwd>interoperability</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        In the context of quantitative empirical modeling, the term predictive analytics
refers to the building and assessment of a model aimed at making empirical
predictions using data and statistical or data mining methods [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In general, the
goal of predictive models is to predict the output value (Y) for new observations
given their input values (X). The inputs are often called the predictors, and more
classically the independent variables while the outputs are called the response,
or classically the dependent variables. Examples of predictive models consider
the prediction of stock market volatility from Yahoo! Finance message board
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], movies success from weblog content [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], in uenza-like illnesses from Google
search queries [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], product sales from Amazon reviews [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and levels of rainfall
from Twitter posts [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        These predictive models can be reused by di erent applications, in the same
way that open data is reused [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ]. Towards this end, standardization activities
such as the Predictive Model Markup Language (PMML)3 has been suggested.
This XML{based language enables importing and exporting developed models
as components in other processes and systems using XML les. However, the
discovery of an appropriate model for a task at hand is at the moment a time
consuming activity that requires a lot of manual e ort involving searching in
scienti c articles, contacting researchers or professionals and exchanging les.
      </p>
      <p>In addition, predictive models incorporate knowledge about a domain or a
problem area. For example, a model could include variables that e ect economic
development. However, usually more than one models can be created about a
speci c problem using di erent data and statistical or data mining methods.
These models provide fragmented views on a speci c problem. Moreover, these
views could be either complementary or controversial. As a result the capability
of connecting these di erent views could enhance the understanding of a problem
and could facilitate the building of more accurate models.</p>
      <p>
        At the same time, the adoption of the Linked Data principles and technologies
[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] has promised to enhance the analysis of statistical data at a Web scale.
For example, Linked Data could facilitate performing data analytics on top of
combined statistical datasets that were previously closed in disparate sources
and can now be linked in order to provide unexpected and unexplored insights
into di erent domains and problem areas [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Moreover, linking statistical data
to the Linked Data Web could enable the enrichment of a particular dataset
and thus the extraction of interesting and previously hidden insights related to
particular events [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>In this paper we suggest that the incorporation of predictive models into
the Linked Data Web could enable new potentials beyond the reuse of
models across di erent platforms. In particular, this could enable the discovery of
predictive models at a Web scale in an easy and e ective manner. For
example, it will make possible queries such as \On which data mining method the
most accurate model that predicts in uenza-like illnesses from Google queries is
based on?" or \What predictor variables should a model aiming at predicting
unemployment include?". Moreover, in this paper we propose an RDF Schema
vocabulary, named the Linked Statistical Models (limo) vocabulary, that will
enable the incorporation of descriptions of predictive models into the Linked
Data Web and establish links to other resources such as datasets, other models,
academic articles and studies.</p>
      <p>The remaining of the paper is organized as follows. In section 2 we describe
the motivation behind the incorporation of predictive models descriptions into
the Linked Data Web. In section 3 we present related work regarding (a) existing
endeavors for describing predictive models and (b) widely used RDF vocabularies
in the area of statistics. Section 4 presents the Linked Statistical Models (limo)
vocabulary. Finally, in section 5 a number of use cases are presented while in
section 6 conclusions are drawn along with future work.</p>
    </sec>
    <sec id="sec-2">
      <title>3 http://www.dmg.org</title>
      <sec id="sec-2-1">
        <title>Motivation</title>
        <p>
          Di erent models could present controversial results in the same problem area
and for the same variables depending on the statistical methods and/or the
data that have been employed. For example, Chiricos [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] reviewed 68 studies
about the relationship between crime and the unemployment rate and he found
that only less than half of these studies have found positive signi cant e ects of
the unemployment on crime rates. In addition, Kalampokis et al. [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] reviewed
52 empirical predictive models that employ predictors related to Social Media.
They identi ed that the predictive power of a model is directly related to the
predictors, the statistical method, the datasets and the evaluation method that
have been selected. Thus, in order to better understand a problem we need to be
able to discover and analyze various models that share common characteristics.
        </p>
        <p>In addition, statistical models that have been developed based on a speci c
dataset can indeed be reused in another case. For example, a model developed
for predicting sales based on data from Company X could be e ciently reused
with data from Company Z. Moreover, a model predicting sales using a speci c
data mining method can be reused as a baseline for another model that uses a
di erent method.</p>
        <p>Publishing descriptions of statistical models on the Web following the Linked
Data principles could have the following bene ts:
1. Discovery of variables that a predictive relationship between them have been
suggested by an empirical model. For example, it will be possible to discover
that X number of models show a predictive relationship between product
sales and advertising budget while Z number of models show a negative or
no relationship between them.
2. Discovery of all predictor variables that are connected to product sales
through successful empirical predictive models.
3. Discovery of statistical or data mining methods that have been used to
identify relationships between variables. For example, most of the models that
are able to accurately predict product sales from advertising budget have
used linear regression methods.
4. Discovery of datasets that have been used to identify predictive relationships
between variables. For example, models that show a strong predictive
relationship between product sales and advertising budget have employed data
from the U.S. in the period between 1975 and 2004.
5. Discovery of a speci c predictive model that shows a relationship between
variables based on aspects such as its creator, the a liation of the creator,
the journal that the results have been published in, etc.
6. Discovery of new datasets in order to reuse existing models. For example,
identi cation of datasets in Europe from the last ten years in order to reuse
a predictive model produced with data from the U.S.
7. Discovery of predictive models that could be used as baseline models in
building new more accurate predictive models.</p>
        <p>These bene ts will be achieved only if a vocabulary to model predictive
models as RDF will be speci ed and Linked Data descriptions of predictive
models will be published at a wide range. The scoping of this paper focuses on
the former.
3</p>
      </sec>
      <sec id="sec-2-2">
        <title>Related Work</title>
        <p>
          The Predictive Model Markup Language (PMML) is an XML standard that
represents and describes data mining and statistical models, as well as some of
the operations required for cleaning and transforming data prior to modeling
[
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. PMML aims to provide enough infrastructure for an application to be able
to produce a model and another application to consume by reading the PMML
XML data le [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. PMML has the following general structure:
{ The Header contains information about the application generated the model
including a time stamp.
{ The Mining Build Task contains vendor speci c information about how the
model was built.
{ The Data Dictionary contains details about the variables, called Data Fields
that participate in the model. These can be though of representing the actual
data used to develop the model including information such as the name, the
type of data (e.g. string, numeric) and how it is used (e.g. is it a continuous
numeric value, a categorical value, etc.).
{ The Transformation Dictionary describes how to manipulate the data elds
from the data dictionary into variables that exist within the PMML de
nition. This includes normalization, discretization, value mapping etc.
{ The Model that contains model-speci c features according to the model
types (e.g. association rules, clustering, general regression, support vector
machines, and neural networks). For instance, the NeuralNetwork element
includes the activationFunction attribute that speci es the activation
function to be used by the network neurons when processing incoming data.
Furthermore, it contains elements that are common to all model types such
as Outputs that de ne the di erent types of results (e.g. predictedValue,
standardError, probability, residual ) that can be generated by a model and
Mining Schema that de nes what to do in case any of the data elds
dened in the DataDictionary element are missing or contain invalid or outlier
values.
        </p>
        <p>This structure presents only top-level elements. PMML is a very rich language
that speci es a very big number of both elements and attributes that are related
to data setup, data pre-processing and model representation. All these elements
aim at enabling model reuse across heterogeneous platforms and environments
for all the major statistical and data mining techniques. We should, however,
note that the rst version of limo that is presented in this paper does not intend
to cover all the details required for importing and executing a predictive model
into an actual platform.</p>
        <p>
          In addition, a number of widely used RDF Schema vocabularies are closely
related to statistics. The DDI-RDF vocabulary [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] focuses on raw record-level
datasets and describes their structure, while the RDF Data Cube vocabulary
[
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] aims at multidimensional aggregated data and provides for the description
of both the structure and the actual data of a dataset.
        </p>
        <p>The DDI-RDF contains the disco:aggregation property that indicates that
a qb:DataSet was derived by aggregating a record-level dataset. Moreover, the
DDI-RDF vocabulary uses the class disco:Variable which provides a de
nition of the column in a rectangular data le and thus enables understanding
of the content of a dataset. In addition, it de nes the disco:LogicalDataSet
class and subclass of dcat:Dataset to provide a description of the content of a
data set. disco:LogicalDataSet is associated with disco:DataFile, subclass
of dcat:Distribution as well as of dctype:Dataset, that actually represents
the physical subsistence of the data set. A disco:LogicalDataset is organized
into a set of instances of disco:Variable.</p>
        <p>In RDF Data Cube, the qb:DataSet represents the resource of the entire
data set, a data set that corresponds to the de ned structure of the RDF
Data Cube. The data sets are allowed to be organized in several slices. The
structure of qb:DataSet or of a slice of the actual data is de ned by the
class qb:DataStructureDefinition. qb:DataStructureDefinition associates
to the qb:component property in order to specify the component(s) of the datasets
structure. The qb:ComponentProperty is the super class property of the
properties that represent dimensions, measures and attributes namely qb:Dimension
Property, qb:MeasureProperty and qb:AttributeProperty respectively.</p>
        <p>
          Finally, based on these modeling endeavors a number of open statistical
datasets published by important international organizations such as the OECD,
the World Bank and the IMF have been transformed to Linked Data by third
parties [
          <xref ref-type="bibr" rid="ref18 ref19">18, 19</xref>
          ]. Towards this end, a number of tools have been developed. For
example, Capadisli et al. [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] created a tool for transforming statistical data
from SDMX-ML format to Linked Data while Salas et al. [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] from CSV and
OLAP databases.
4
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>The limo Vocabulary</title>
        <p>In this section we present the RDF Linked Statistical Models (limo) vocabulary
that allows for the description of statistical and data mining models in the RDF
model and thus enables the incorporation of these models on the Linked Data
Web and linking to others resources such as datasets, organizations, people and
articles.</p>
        <p>
          In general, predictive analytics comprise predictive models designed for
predicting new (or future) observations or scenarios as well as methods for
evaluating the predictive power of a model [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]. The outcome value for a new set of
observation could be continuous (or quantitative) or categorical (or qualitative).
In the former case the problem is ofter referred to as a regression problem while
in the latter a classi cation problem. Predictive power refers to an empirical
models ability to predict new observations accurately. In contrast, explanatory
power refers to the strength of association indicated by a statistical model.The
predictive power of a model should be tasted based on out-of-sample data (e.g.
cross-validation or a holdout sample) and with adequate predictive measures
(e.g. RMSE, MAPE, PRESS etc.). A popular method to obtain out-of-sample
data is to initially partition the data randomly, using one part (the training set)
to t the empirical model, and the other (the holdout set) to assess the model's
predictive accuracy.
        </p>
        <p>The vocabulary's main classes are depicted in Fig. 1. Classes and properties
from existing widely used vocabularies were reused whenever possible.
{ limo:Model is the actual predictive model that is described by the
vocabulary. The model has the following attributes:
{ dct:title which is a name given to describe the model.
{ dct:description for a descriptive comment about the model and its goals.
{ dct:issued which de nes the actual data that the model has been created.
{ limo:modelType which describe the main categories of models that can
be developed, namely classi cation, regression, clustering and
dimensionReduction.
{ limo:spatial is an attribute that describe the spatial dimension of the
model. The spatial dimension of the model is derived from the actual data
that have been employed. For example, a model could have limo:spatial
U.S. in the case the data used for the development of the model comes from
the U.S.
{ limo:temporal is an attribute that describe the time period that the model
covers. The time period of the model re ects the period that is described in
the actual data that have been used for the development of the model.
limo:Model is connected through limo:data property to a multi-dimensional
data set i.e. a qb:DataSet. This dataset contains the actual data that have
been used for the development of the model. As a result, the temporal and
spatial dimension of the model could be also extracted from this dataset. In
predictive analytics we have three di erent types of data, namely evaluation,
validation and training data. So, limo includes three di erent sub-properties
of the limo:data property, one for each of these three types of data.
limo:Model is also connected through limo:rawData property to a dctype:
Dataset. This dataset includes the raw data that have been used in the
process of building the model. For example, this dataset could be a dump
of raw tweets or a dcat:Dataset which thereafter was analyzed in order to
produce the actual data employed be the model.</p>
        <p>Moreover, the limo:Model can be connected to a di erent limo:Model through
the limo:baseline property which explicitly denotes that the predictive
power of a model has been evaluated against the power of another model.
The limo:Model can be also published in a scienti c article or report. Hence
we have included the limo:publishedIn property to express this
relationship.
Finally, limo:Model is connected to a foaf:Agent through the dct:creator
property. This property denotes the person or organization that actually
builds the model.
{ limo:Variable represents the variables that are included in the predictive
model. The Variable class includes the following attributes
{ The dct:title denotes the actual name of the variable.
{ The dct:description enables the inclusion of a small text in order to
describe what the variable is about.
{ The limo:variableType attributes denotes whether the variable is
continuous, categorial or ordinal.
{ The limo:usageType denotes whether the variable is the response of the
model or one of the predictors.</p>
        <p>In addition, limo:Variable is categorized using the limo:theme property
which connects the Variable to a skos:Concept
{ limo:Method describes the statistical or data mining method used for
creating the model. We assume that this class uses a set of prede ned concepts
such as linear regression, logistic regression, markov models, support vector
machine, random forests, neural networks etc. As a result, we assume that
limo:Method is subclass of skos:Concept.
{ limo:Power describes the predictive power of the model. The predictive
power has the following attributes:
{ limo:evaluationMethod is used to infer the predictive power of a model.
The evaluation methods include out-of-sample evaluation with statistics such
as Predicted Residual Sums of Squares, Root Mean Square Error or
crossvalidation techniques.</p>
        <p>{ limo:outcome is the actual value that the evaluation method produces.
{ limo:File describes a le that can be imported in a particular platform
such as R or SAS and execute the model. This could also be a PMML-XML
le.</p>
        <p>We should note that in this preliminary version of the vocabulary the
execution of the model is possible through a PMML XML le. In the next version
we aim at providing a more detailed description of the model in order to enable
the execution of a model through its limo description. Full documentation of the
limo vocabulary is available online4.
5</p>
      </sec>
      <sec id="sec-2-4">
        <title>Using limo</title>
        <p>In this section we present how limo vocabulary can be used in order (a) to
describe a predictive model and (b) to enable the discovery of predictive models
that address some requirements.</p>
        <p>
          Below we present the limo description of the predictive model developed by
Ginsberg et al. and presented in [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. This model aims at predicting in
uenzalike illness (ILI) physician visits from ILI-related queries. The models employs
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4 http://purl.org/limo-ontology/limo</title>
      <p>qb:DataSet
limo:evaluationData
limo:validationData
limo:trainingData
limo:baseline
limo:publishedIn
limo:file
dct:BibliographicResource</p>
      <p>limo:File
limo:accessURL
limo:power</p>
      <p>limo:data
dctype:Dataset
limo:rawData</p>
      <p>Fig. 1. The Linked Statistical Models vocabulary
a linear regression method as well as data from Google and the US Centers for
Disease Control and Prevention. The data is about nine regions of the United
States between 2003 and 2008. The model was assessed using cross validation
against out-of-sample data partitions and they obtained a mean correlation of
0.97.</p>
      <p>
        Description of the predictive model presented in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] with limo
eg:DDCILImodel a limo:Model;
dct:title "CDC-ILI model"@en;
limo:spatial [rdf:type dbpedia:United_States];
limo:temporal
[a dc:terms PeriodOfTime;
limo:startDate "2003-09-28"^^xsd:date;
limo:endDate "2008-05-11"^^xsd:date;];
limo:modelType eg:regression;
limo:variable eg:resp;
limo:variable eg:pred;
limo:method eg:linearregression;
limo:power eg:CDCILIpower;
limo:file eg:CDCILIfile;
limo:rawData eg:CDCILIdataset;
limo:evaluationData eg:CDCILIevaluationdata;
limo:validationData eg:CDCILIvalidationdata;
limo:trainingData eg:CDCILItrainingdata;
dct:creator eg:ginsberg, eg:mohebbi, eg:patel, eg:brammer,
eg:smolinski, eg:brilliant;
eg:resp a limo:Variable;
limo:variableType eg:continuous;
dct:description "Percentage of physician visits in which a
patient presents with influenza-like symptoms in a region"@en;
limo:usageType eg:response;
limo:theme eg:ILIphysvisits.
eg:pred a limo:Variable;
limo:variableType eg:continuous;
dct:description "Probability that a random search query
submitted from a region is ILI-related"@en;
limo:usageType eg:predictor;
limo:theme eg:ILIrandquery.
eg:CDCILIpower a limo:Power;
limo:evaluationMethod eg:crossvalidation;
limo:outcome 0.97.
eg:CDCILIdataset a dctype:DataSet;
      </p>
      <p>dct:resource &lt;http://www.cdc.gov/flu/weekly&gt;.</p>
      <p>In addition, limo will enable the performance of queries across distributed
description of predictive models. For example below we present a query answering
the question \How many models exist that show relationship between the
percentage of in uenza-related physician visits and the probability that a random
search query submitted from a region is in uenza-related?".</p>
      <p>A query for identifying models that predict in uenza-like illnesses from search
query data</p>
      <sec id="sec-3-1">
        <title>SELECT (count( ?model ) as ?nmodels) WHERE { { }</title>
        <p>}
} UNION
{
?model limo:variable ?variable1;</p>
        <p>limo:variable ?variable2.
?variable1 limo:usageType eg:response;</p>
        <p>limo:theme eg:ILIphysvisits;
?variable2 limo:usageType eg:predictor;</p>
        <p>limo:theme eg:ILIrandquery;
?model limo:variable ?variable1;</p>
        <p>limo:variable ?variable2.
?variable1 limo:usageType eg:predictor;</p>
        <p>limo:theme eg: LIphysvisits.
?variable2 limo:usageType eg:response;</p>
        <p>limo:theme eg: ILIrandquery.</p>
        <p>Moreover, a query based on limo could unveil the variables that are predictors
of in uenza-related physician visits through empirical model(s) constructed by
data regarding the U.S. The identi cation of these variables could enhance the
process of building predictive model for in uenza illnesses.</p>
        <p>A query for identifying predictors of inluenza-like illnesses</p>
      </sec>
      <sec id="sec-3-2">
        <title>SELECT ?variable</title>
        <p>WHERE {
}
?model limo:variable ?variable1.
limo:variable ?variable2.</p>
        <p>limo:spatial ?sp1.
?variable1 limo:usageType eg:predictor.
?variable2 limo:usageType eg:response;
limo:theme eg:ILIphysvisits.
?sp1 rdf:type dbpedia:United_States.</p>
        <sec id="sec-3-2-1">
          <title>Conclusions</title>
          <p>Predictive analytics refer to the process of building a model that enables the
prediction of new observations using data and statistical or data mining methods.
Predictive models are very important in businesses, academia and governments
as they can predict values such as sales and identify patterns regarding e.g.
pro table customers or the behavior of citizens. These models can be indeed
reused across platforms and in di erent cases. Although, standards for
transferring models across di erent platforms have been proposed, at the moment it is
di cult to discover an appropriate model for a task at hand at a Web scale.</p>
          <p>In this paper we suggested that descriptions of predictive models should
be incorporated into the Linked Data Web and we proposed an RDF Scheme
vocabulary towards this end. We described the main classes of the vocabulary
and we presented an example of how the vocabulary can be used in order to
describe a predictive model. We also demonstrated how the vocabulary can be
used in order to facilitate the discovery of predictive models.</p>
          <p>We believe that the adoption of the vocabulary could create new potentials
beyond cross-platforms reuse of models. In particular, the vocabulary will enable
(a) easy discovery and reuse of appropriate models at a Web Scale and (b)
creation of more accurate models exploiting connections of models to other models,
datasets and other resources on the Web.</p>
          <p>Future work includes further evaluation of the vocabulary by describing a
larger number of predictive models and by incorporating a linked data set into
the linked data cloud. This will enable the execution of more complex queries
and the evaluation of the vocabulary in real world settings. In addition, the
possibility of extending limo with execution capabilities will be considered. This
includes enriching limo with classes and attributes that will allow for importing
RDF data into popular open platforms such as R and executing the actual model.</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>Acknowledgments</title>
          <p>The work presented in this paper was partially carried out in the course of the
Linked2Safety 5 project, which is funded by the European Commission within
the 7th Framework Programme under grand agreement No. 288328.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5 http://www.linked2safety-project.eu/</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Shmueli</surname>
          </string-name>
          , G.: To Explain or to Predict? Statistical Science,
          <volume>25</volume>
          (
          <issue>3</issue>
          ),
          <volume>289</volume>
          {
          <fpage>310</fpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Antweiler</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frank</surname>
            ,
            <given-names>M.Z.</given-names>
          </string-name>
          :
          <article-title>Is all that talk just noise? the information content of internet stock message boards</article-title>
          .
          <source>Journal of Finance</source>
          ,
          <volume>59</volume>
          (
          <issue>3</issue>
          ),
          <volume>1259</volume>
          {
          <fpage>1294</fpage>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Mishne</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Glance</surname>
          </string-name>
          , N.:
          <article-title>Predicting Movie Sales from Blogger Sentiment</article-title>
          . In American Association for Arti cial Intelligence 2006 Spring Symposium on Computational Approaches to Analysing Weblogs (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Ginsberg</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mohebbi</surname>
            ,
            <given-names>M. H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Patel</surname>
            ,
            <given-names>R. S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brammer</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smolinski</surname>
            ,
            <given-names>M. S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brilliant</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Detecting in uenza epidemics using search engine query data</article-title>
          .
          <source>Nature</source>
          ,
          <volume>457</volume>
          (
          <issue>7232</issue>
          ),
          <volume>1012</volume>
          {4 (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Ghose</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ipeirotis</surname>
            ,
            <given-names>P.G.</given-names>
          </string-name>
          :
          <article-title>Estimating the Helpfulness and Economic Impact of Product Reviews: Mining Text and Reviewer Characteristics</article-title>
          .
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          ,
          <volume>23</volume>
          (
          <issue>10</issue>
          ),
          <volume>1498</volume>
          {
          <fpage>1512</fpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Lampos</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cristianini</surname>
          </string-name>
          , N.:
          <article-title>Nowcasting Events from the Social Web with Statistical Learning</article-title>
          .
          <source>ACM Transactions on Intelligent Systems and Technology</source>
          ,
          <volume>3</volume>
          (
          <issue>4</issue>
          ) (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Grossman</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mazzucco</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>DataSpace: a data Web for the exploratory analysis and mining of data</article-title>
          .
          <source>Computing in Science and Engineering</source>
          ,
          <volume>4</volume>
          (
          <issue>4</issue>
          ),
          <volume>44</volume>
          {
          <fpage>51</fpage>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Grossman</surname>
            ,
            <given-names>R. L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hornick</surname>
          </string-name>
          , M. F., Meyer, G.:
          <article-title>Data mining standards initiatives</article-title>
          .
          <source>Communications of the ACM</source>
          ,
          <volume>45</volume>
          (
          <issue>8</issue>
          ),
          <volume>59</volume>
          {
          <fpage>61</fpage>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heath</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Berners-Lee</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Linked data - the story so far</article-title>
          .
          <source>International Journal on Semantic Web and Information Systems</source>
          <volume>5</volume>
          (
          <issue>3</issue>
          ),
          <volume>122</volume>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Kalampokis</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tambouris</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tarabanis</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Linked Open Government Data Analytics</article-title>
          . In: Wimmer,
          <string-name>
            <given-names>M.A.</given-names>
            ,
            <surname>Janssen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Scholl</surname>
          </string-name>
          ,
          <string-name>
            <surname>H.J. (eds.) EGOV</surname>
          </string-name>
          <year>2013</year>
          .
          <article-title>LNCS</article-title>
          , vol.
          <volume>8074</volume>
          , pp.
          <volume>99</volume>
          {
          <fpage>110</fpage>
          . IFIP International Federation for Information Processing (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Paulheim</surname>
          </string-name>
          , H.:
          <article-title>Generating Possible Interpretations for Statistics from Linked Open Data</article-title>
          . In: Simperl,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>Cimiano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Polleres</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Corcho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            ,
            <surname>Presutti</surname>
          </string-name>
          , V. (eds.)
          <article-title>ESWC 2012</article-title>
          .
          <article-title>LNCS</article-title>
          , vol.
          <volume>7295</volume>
          , pp.
          <volume>560</volume>
          {
          <fpage>574</fpage>
          . Springer, Heidelberg (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Chiricos</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <source>Rates of Crime and Unemployment: An Analysis of Aggregate Research Evidence, Social Problem</source>
          <volume>34</volume>
          ,
          <issue>187</issue>
          {
          <fpage>212</fpage>
          (
          <year>1987</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Kalampokis</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tambouris</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tarabanis</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Understanding the Predictive Power of Social Media</article-title>
          .
          <source>Internet Research</source>
          ,
          <volume>23</volume>
          (
          <issue>5</issue>
          ) (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Wettschereck</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Muller</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>2001</year>
          )
          <article-title>Exchanging Data Mining Models with the Predictive Modelling Markup Language</article-title>
          . International Workshop on Integration and
          <article-title>Collaboration Aspects of Data Mining, Decision Support and Meta-Learning</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Pechter</surname>
          </string-name>
          , R.:
          <article-title>What's PMML and What's New in PMML 4.0?</article-title>
          .
          <source>ACM SIGKDD Explorations Newsletter</source>
          ,
          <volume>11</volume>
          (
          <issue>1</issue>
          ),
          <volume>19</volume>
          {
          <fpage>25</fpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Bosch</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cyganiak</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gregory</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wackerow</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>DDI-RDF Discovery Vocabulary: A Metadata Vocabulary for Documenting Research and Survey Data</article-title>
          .
          <source>In: LDOW2013, May</source>
          <volume>14</volume>
          ,
          <year>2013</year>
          , Rio de Janeiro,
          <string-name>
            <surname>Brazil</surname>
          </string-name>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17. W3C,
          <string-name>
            <surname>The RDF Data Cube Vocabulary. W3C Working Draft</surname>
          </string-name>
          (
          <year>2013</year>
          ), http://www.w3.org/TR/vocab
          <article-title>-data-cube/</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Capadisli</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ngonga</surname>
            <given-names>Ngomo</given-names>
          </string-name>
          , A.-C.:
          <string-name>
            <surname>Linked SDMX</surname>
          </string-name>
          <article-title>Data: Path to high delity Statistical Linked Data for OECD, BFS, FAO, and ECB</article-title>
          . Semantic
          <string-name>
            <surname>Web</surname>
          </string-name>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Capadisli</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          : Statistical Linked Dataspaces.
          <source>Master's thesis</source>
          , National University of Ireland (
          <year>2012</year>
          ), http://csarven.ca/statistical-linked-dataspaces
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Salas</surname>
            ,
            <given-names>P. E. R.</given-names>
          </string-name>
          , Martin,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Mota</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. M. D.</given-names>
            ,
            <surname>Auer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Breitman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Casanova</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. A.</surname>
          </string-name>
          :
          <article-title>Publishing Statistical Data on the Web</article-title>
          .
          <source>In: IEEE Sixth International Conference on Semantic Computing (ICSC)</source>
          , pp.
          <volume>285</volume>
          {
          <fpage>292</fpage>
          . IEEE Press, New York (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Shmueli</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koppius</surname>
            ,
            <given-names>O.R.</given-names>
          </string-name>
          :
          <source>Predictive Analytics in Information Systems Research. MIS Quarterly</source>
          <volume>35</volume>
          (
          <issue>3</issue>
          ),
          <volume>553</volume>
          {
          <fpage>572</fpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>