<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>ER Forum, Demo and Posters</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Impact of Data Quality in Real-Time Big Data Systems</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jorge Merino</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xiang Xie</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ajith Kumar Parlikad</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ian Lewis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Duncan McFarlane</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science and Technology, University of Cambridge</institution>
          ,
          <addr-line>Cambridge</addr-line>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <volume>73</volume>
      <fpage>73</fpage>
      <lpage>86</lpage>
      <abstract>
        <p>Data Quality is one of the main challenges in any type of Big Data System. Timeliness is one of the main factors in real-time Big Data. Limiting data quality evaluations to data sources may be insufficient in Big Data Systems with high Velocity and Variability. On the other hand, real-time Data Quality evaluations throughout the Big Data Pipeline can be costly (i.e., latency introduced by Data Quality Evaluations). This paper identifies four categories embedded, parallel, in-line, and independent- of approaches for Big Data Quality Evaluation available in the literature. A real-time Big Data System based on the SmartCambridge Real-Time Data Platform is deployed and used as basis to implement a representative case for each one of the four categories identified. An application for bus catching dynamic prediction is used as case study to quantify the impact of these Data Quality Evaluations in the Real-Time Data Platform in terms of latency introduced in the system. Results suggests that the impact of Data Quality Evaluations differ depending on the type of method used, and that the main factors are the data transfers between Data Quality modules and the data processing algorithms, the synchronisation of messages, and the complexity of the Data Quality algorithms.</p>
      </abstract>
      <kwd-group>
        <kwd>Big Data</kwd>
        <kwd>Data Quality</kwd>
        <kwd>Real-Time</kwd>
        <kwd>Timeliness</kwd>
        <kwd>Internet of Things</kwd>
        <kwd>Smart Cities</kwd>
        <kwd>Velocity</kwd>
        <kwd>Latency</kwd>
        <kwd>Streaming</kwd>
        <kwd>SmartCambridge</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Smart cities provide intelligent services over six axes (i.e., economy, mobility,
environment, people, living, and governance) to improve the quality of life of citizens [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
Social Network, meteorology, and IoT are often used as sources of real-time data for
smart commuting (e.g., traffic, public transport), pollution analysis (e.g., quality of air
and water, noise levels) among others. Most smart cities applications have to deal
with high-volume, high-velocity, and/or high-variety data [
        <xref ref-type="bibr" rid="ref2 ref27">2</xref>
        ].
      </p>
      <p>Real-Time Big Data systems collect and process data from different streaming
sources like sensors, smart tags, networks, and other systems (e.g., meteorological,
social networks, etc.) while minimising the latency. Data is often integrated with
nonreal-time data of spaces, assets, products, and processes to make timely and informed
decisions.</p>
      <p>
        Each one of these sources have different data quality levels depending on factors
like errors in the readings, granularity in time and space of the data, quality of
communication networks, and data processing capability [
        <xref ref-type="bibr" rid="ref2 ref27">2</xref>
        ]. All these factors affect
Timeliness, which can be defined as the minimum acceptable latency in a
decisionmaking process [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Timeliness is one of the main factors that affect the decision
quality in Real-Time Big Data systems [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Timeliness is also one of the dimensions
that define Data Quality, commonly defined as “fit-for-use” [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. If decision is made
on outdated data, it is considered as poor-quality and will likely be suboptimal [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ],
[
        <xref ref-type="bibr" rid="ref2 ref27">2</xref>
        ], [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]–[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>
        A myriad of Big Data Quality Evaluation alternatives coexists in the literature.
Many conduct the evaluation of Data Quality on a subset or a sample of the data [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]–
[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] which can help optimising the time to evaluate Data Quality but may reduce the
significance and trust of the evaluation. A number of approaches suggest to evaluate
data quality separately from the main Big Data pipeline in order to avoid impacting
the latency [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]–[16]. However, it has been proven that data quality errors can be
introduced at different points in the analytics pipeline [17]. A Big Data analysis may
perceive data sources as trusted if these data sources scored good quality levels in the
past, but this trust may be outdated. Introducing data quality evaluation steps (i.e.,
real-time evaluation) may enhance timeliness, and thus, decision-making, but may
also incur cost every time the pipeline is executed. Extra cost is introduced in terms of
a) additional latency inserted in every new evaluation step added to the system, b)
additional memory load to temporarily store data for its quality evaluation, c)
concurrent computation demand to evaluate data quality at the same data is being processed
by Big Data analyses, or d) higher complexity of Big Data analyses hindering
maintenance.
      </p>
      <p>This paper introduces a classification of methods for Big Data Quality Evaluation
approaches: Embedded, Parallel, In-line, and Independent. An implementation of each
type of evaluation approaches is implemented into a Real-Time Big Data System
(Intelligent City Platform [18]). The implementations are tested using a smart city
application for Bus catching prediction as a case study. The latency introduced by
each implementation is quantified to measure the impact of the different Big Data
Quality evaluation categories. The results of each category are compared, and the
benefits and drawbacks of each alternative are analysed.</p>
      <p>Section 2 contextualizes the concept of Data Quality for Big Data and extracts four
categories of methods for Big Data Quality evaluation from the literature. A research
method based on latency quantification and benchmarking is presented in section 3.
Section 4 describes the framework and the case study used to benchmark the different
categories of Big Data Quality evaluation methods. The results of the benchmark are
shown in section 5. Section 6 provides conclusions.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Data Quality Evaluation in Big Data</title>
      <p>Gartner defines Big Data as “high-volume, high-velocity and high-variety information
assets that demand cost-effective, innovative forms of information processing for</p>
      <p>Impact of Data Quality in Real-Time Big Data Systems 75
enhanced insight and decision making” [19]. Some authors have introduced Value,
and Veracity to the set [20], [21]. Big Data is often managed through Big Data
Pipelines (Collection, Preparation/Curation, Analysis, Visualisation, and Access) [22].
[23] differentiates between data quality (i.e., assessed at the time data is collected)
and information quality (i.e., assessed at the time data is being used).</p>
      <p>
        [17] proved that even that Data Quality is reduced in the same rate as the Volume
increases, and Variety exponentially reduces the Consistency of data. [17] also
identified that many errors are introduced during the traditional Big Data Pipeline and not
all data defects may be filtered. Thus, it highlights the compromise between filtering
data defects versus the use of a quality-based trust factor for the Big Data Analysis in
a very much needed multiphase data quality evaluation. Most approaches conduct the
evaluation of Big Data Quality during the Data Collection phase [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]–[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], [24] or
during Data Preparation/Curation phase [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], [16], [25]–[27]. For instance, data
miners focus on accuracy and outliers removal as the most important factors that
define data quality [28], [29]. [30] measures the quality of integration of different data
schemas which affect the quality of the data used in the Big Data Analysis phase. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]
proposes a framework to evaluate quality of Big Data after each phase of the
traditional Big Data Pipeline and acknowledges that different Data Quality dimensions
apply to some/all phases of the Big Data Pipeline.
      </p>
      <p>
        [31] proposes an ontology-based data quality measurement and monitoring
framework for data streams, including content (i.e., semantics of the data flowing in the
stream), queries (i.e., aggregation and integration), and application (i.e.,
contextdependent quality requirements). [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] analyses the quality measures for image-based
crowd-sourced big data, emphasising the time related dimensions over the rest in
realtime Big Data systems. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] centres its attention in the timeliness of the Big Data
Analysis providing a method to measure the freshness of data. Some of these authors
emphasize the trade-off between Data Quality and performance for real-time data in
their approaches. [31] affirms that while performance must ensure real-time
processing, sufficient quality of computed results must be achieved. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] suggests
boosting performance by sampling the available data for quality evaluation, and then
use regression to extrapolate results. Sampling may help to reduce the overall latency
of the Big Data Pipeline and keep performance, but control mechanisms like sampling
must be tuned to preserve the balance between performance and sufficient data
quality evaluation [31].
2.1
      </p>
      <p>Classification of methods for Big Data Quality evaluations
This section introduces one of the contributions of this paper: a classification of the
methods available in the literature for Big Data Quality evaluation. Considering the
above, four categories of methods have been identified:
1. Embedded: These methods evaluate quality during the Data Processing step by
incorporating quality constraints to the Big Data Analysis. It allows tighter
connections to the business applications (e.g., most Machine Learning regression
algorithms include outlier removal) at the cost of higher complexity in maintenance of</p>
      <p>the system and lower flexibility. These methods are used when adequate levels of
Data Quality are essential to either filter out defective data or annotate it for further
analysis.
2. Parallel: In this type of methods the evaluation is conducted concurrently in a
separate data flow from the Big Data Analysis. Performance is often the main driver
of the Big Data Analysis (e.g., high-Velocity or high-Volume data that must be
rapidly processed), but quality also plays a key role. Hence, quality evaluation
must not introduce any latency in the main data processing flow. Quality
evaluation results are used as risk indicators (i.e., analysis metadata) that can be checked
after the Big Data Analysis is finished (e.g., when the Data Quality levels are not
sufficient, the risk of using the results of a Big Data Analysis on the same data to
make a decision is high). These methods are commonly used to monitor Data
Quality in in Data Stream applications.
3. In-line: These methods evaluate data quality before the Big Data Analysis, but in
the same data flow. These scenarios are more commonly known as data curation,
and they prioritise the identification and correction or filtering of data defects.
Used in Big Data systems supporting vital decision-making and performance being
secondary. These methods include data profiling, assurance, lineage, data tagging,
and filtering.
4. Independent: These methods conduct the evaluation in an unconnected data flow
and usually before the main Big Data Pipeline is even executed. It the least
disruptive approach as the evaluation is conducted independently from the Big Data
Analysis. The main drawback is that these approaches are not designed to evaluate
the data on-the-go as it flows throughout the Data Pipeline. Thence, the Data
Quality levels do not normally consider the message being processed in real-time by the
Big Data Analysis, but rather historical data. The results of the data quality
evaluations are often stored as metadata of the data sources, or in a metadata repository
when evaluations are more granular and/or it comes from different sources. This
category includes methods for data linage, data source evaluation, data provenance,
data cataloguing.</p>
      <p>Table 1 classifies the Big Data Evaluation methods in the literature according to
these 4 categories (some methods can be classified into more than one category).
This paper uses the term “real-time” to refer to an unobstructed flow of incoming data
through a Big Data system. We use a publish-subscribe Real-Time system that
manage fast-paced data (also referred as Data Streams) and prioritise the minimization of</p>
      <p>Impact of Data Quality in Real-Time Big Data Systems 77
end-to-end latency (i.e., Velocity). It is important to highlight that in this system, the
messages passing are handled asynchronously. This means that the messages are not
directly available in the memory of the subscriber when a module publishes them, but
when the subscriber has utterly received them.</p>
      <p>The impact of Data Quality in this type of Big Data Systems will be analysed by
measuring and comparing the latency introduced by Embedded, Parallel, In-Line, and
Independent methods for Big Data Quality Evaluation. Not all the alternatives from
the literature are benchmarked, but rather, a custom implementation applicable in the
selected framework (see section 4) representing each one of the identified categories.
The impact is always quantified in terms of latency introduced in the normal flow of
data, but it varies depending on the addressed category. Fig. 1 visually shows the
formulas to calculate the latency.</p>
      <p>Given a data quality method DQmethod that evaluates the quality of a message m,
and a data processing task DP that processes the message m:
 Embedded:  =  _ −  _ _ , with  _
being the time when the algorithm of the DQmethod finishes, and  _ _
being the time when the message m arrives to the data processor DP. As the
DQmethod is embedded into the DP, the data transfer does not introduce latency.</p>
      <p>Hence, the latency is always a positive number,  ∈ (0, ∞).
 Parallel: In this approach, the data processing tasks are not directly impacted by
the data quality evaluation algorithms as they run concurrently. Both may impact
each other in the cases when they share infrastructure. Notwithstanding, it is
necessary to introduce a MsgIntegrator MI to integrate the processed data and the data
quality evaluation results. Consequently,  =  _ _ −</p>
      <p>(−∞, ∞).
 In-line: 
(0, ∞).


_
_
, with 
_
_</p>
      <p>being the time when the message mDQ from
DQmethod arrives to the MsgIntegrator, and 
_
_
being the time when
the message mDP from the data processor DP arrives to the MsgIntegrator MI.
Thereby, the latency could be negative if mDQ arrives before mDP, latency ∈
= 
_ _
− 
_ _
, with 
_ _
time when the message m arrives to the data processor DP and 
the time when the message m arrives to the DQmethod. The DQmethod is executed
before the DataProcessor. As a result, the latency includes the duration of the
DQmethod itself plus data transfers. The latency is a positive number, 
_ _
being the
being
 Independent: These alternatives run the DQmethod independently in a different
timeline (time0) and store the result in a DQ metadata repository. DQ levels are
used during the normal flow of data (time1). Thus, the latency introduced depends
only on the duration of loading the data quality levels from the data quality
metadata repository. Then, the latency is defined as 
_ _
, with 
_</p>
      <p>being the time when data quality levels have
been loaded from the data quality metadata repository, and 
time when the message m arrives to the data processor DP. Thence, the latency is a
positive number, 
∈ (0, ∞).
∈
−
= 
_ _
_
being the</p>
      <p>The formulas have been defined with the objective of quantifying the entire latency
introduced by each DQ approach. This means that the formulas include not only, the
time to run the DQmethod algorithm itself, but also the time to transfer the messages,
loading the data quality metadata from repositories, or synchronising messages for
integration. In this manner, the latencies can be compared between different
categories. The implementation of these formulas is described in the following section
alongside the depiction of the Big Data System that serves as a framework for the
implementation, and the case study used for the benchmark.
4</p>
      <p>Framework and Case Study – Smart Cambridge iCP</p>
      <p>The Smart Cambridge Intelligent City Platform (iCP) gathers data from sensors
installed around the Cambridge region. A Data Hub is used to ingest, process, visualise,
and provide access to collected data in real-time. The main purpose of the iCP is to
explore the future of transport around Cambridge, including how traffic congestion
can be reduced and air quality improved. Implemented applications include car parks
surveillance and status (Cameras and barriers counters), traffic status (traffic lights,
traffic monitoring cameras), public transport (busses GPS, train schedules), and air
quality (Air quality sensors, including CO2). A new LoRa (Low Power Long Range)
network has also been established in collaboration with the University of Cambridge
to transfer the data flowing in from multi-purpose sensors to the Data Hub. One of the
main drivers of the iCP is data availability and visualisation to enable the creation of a</p>
      <p>Impact of Data Quality in Real-Time Big Data Systems 79
city level digital twin. This information is visualised in smart panels in some
buildings at the University of Cambridge and is publicly available at [34].</p>
      <p>The architecture of the Data Hub proposed by [18] is implemented as a framework
to benchmark the latency introduced by the identified types of Big Data Quality
Evaluation approaches in section 2.1. The core architecture [35] of the Data Hub is
depicted in Fig. 2. The design of the architecture is based on the real-time publish-subscribe
model and it is implemented using Java and Vertx [36]. Latency between messages
input to output is in the range of milliseconds. The main modules in this architecture
are described below:
 EventBus: serves as a route of push communication between the different modules.</p>
      <p>It has different topics that the modules can subscribe and publish to. A topic is an
address in the EventBus (e.g., buses/sensors/, is a topic for all the sensors in the
buses).
 FeedHandlers: subscribe to different data sources (e.g., vehicle position data) and
publishes it on the EventBus. In some cases, the format of incoming data may be in
a different format. Thus, the FeedHandlers can parse incoming data into a format
that the platform can easily manage and understand (i.e., JSON). FeedHandler also
archives every post of binary data as a timestamped file in the server filesystem.
 FeedMakers: Often, it is not possible to subscribe to data sources from legacy
systems. Most cases these systems store data in relational data bases, XML, Excel
files, etc. In those cases, a FeedMaker will query data from those sources
periodically (typically minutes) store it raw in the file system, and parse and publish it as a
message on the EventBus.
 MsgFilers: are general-purpose modules that subscribe to messages on the
EventBus and persist them in the filesystem. FeedCSV is a specific type of MsgFiler that
stores the bus position data in CSV format.
4.1</p>
      <p>Data Quality Evaluation in the Real-Time Data Platform</p>
      <p>Two new modules called DQAnalyser and MsgIntegrator were developed to
enable Data Quality Evaluation in the Real-Time Data Platform, and an additional
module named DataProcessor responsible for the data processing tasks (i.e., Big Data
Analysis). These modules are used to implement the formulas to calculate the latency
introduced by data quality. Fig. 3 shows how these modules implement those
formulas for each one of the four categories.
 DQAnalyser: analyses data quality of messages flowing in a particular topic in the
EventBus. DQAnalyser can be configured to either attach the data quality score to
the message or store it as metadata in a data-quality-metadata repository.
DQAnalyser can persist the messages with the Data Quality score in the file system and
publish it back to the EventBus. DQAnalyser creates DQIndicators to configure the
algorithm for the DQmethod that calculates the data quality score of the messages.</p>
      <p>Thus, new algorithms can be easily incorporated.
 MsgIntegrator: acts like a funnel, reading messages from multiple topics in the
EventBus and publishing them back to a common topic. MsgIntegrator can also
synchronise the messages from the subscribed topics and integrate them.
 DataProcessor: dedicated to reading messages from one or multiple addresses in
the EventBus and from the filesystem and conduct processing tasks on them. The
DataProcessor can publish the processed messages back in the EventBus or store
them in the filesystem. DataProcessors can be configured to be data quality aware,
in which case, they read from a data-quality-metadata repository. Data Processors
can also call DQIndicators that implement DQmethod algorithms</p>
      <p>Impact of Data Quality in Real-Time Big Data Systems 81
A bus catching prediction application is used as a case study to generate data on the
latency introduced by each one of the four types of alternatives for Big Data Quality
(Embedded, Parallel, In-line, Independent). The application reads the user position
and predicts whether the user can catch the next bus to arrive to the desired bus stop.
It uses the data from buses geolocation, routes, and schedules from the iCP platform,
plus the geolocation from the user. Then calculates the likelihood of the user arriving
to the bus stop before the next bus and suggests the user to either take the next bus or
wait for the following one. Fig. 4 shows the general data flow and decision-making
process of the application. Given that the application itself is not the main goal of the
case study, the decision has been simplified to facilitate understanding.</p>
      <p>Some data quality concerns arise from the decision-making process of this
application. As it uses asynchronously collected real-time data, there is always a delay
between the last geolocation position submitted by a given bus and its real geolocation
position. This is an example of Timeliness of data, as it creates a time window in
which it is safe to make the decision using the last bus position reading. Given that the
purpose of this case study is to quantify latency, the DQ evaluation has been
simplified.
5</p>
      <p>Analysis of the Impact of Data Quality in Real-Time Big Data
1000 messages from buses were ingested to analyse the impact (i.e., in terms of
latency) of the different categories for Big Data Quality Evaluation: Embedded, Parallel,</p>
      <p>In-Line, and Independent. Table 2 compares the latency introduced by these four
categories in the system. Fig. 5 visualises the latency introduced while processing the
1000 messages in chronological order and its distribution. These experimental results
were obtained from taking timestamps for each method as defined in section 3.
ules are called from the DataProcessor. DQIndicator modules are responsible for
calling the right DQmethod (for this case study, the Timeliness evaluator; see
sections 4.1, and 4.2). In this implementation, the calls from the DataProcessor and
the DQAnalyser also follow the publish-subscribe paradigm as all the modules
work asynchronously, avoiding waits and locks. Hence, the latency can be defined
as a function of the complexity of the DQmethod used for the evaluation plus the
depending
on
the
framework,
 Parallel: The implementation of this approach in this paper includes a
MsgIntegrator responsible for collecting, synchronising, and integrating the messages mDP and
mDQ from the DataProcessor and the DQAnalyser respectively (see section 4.1).
Therefore, the latency can be defined as a function of the syncronisation phase, as
the MsgIntegrator needs to wait for the last message to arrive: 
, 
= 
.
 (</p>
      <p>ℎ ) + 
repository),</p>
      <p>= 
 In-Line: Similarly to embedded alternatives, the DQAnalyser needs to evaluate the
quality of the message m before it enters the DataProcessor (see sections 3, and
4.1). Thereby, the latency can be defined as a function of the data transfers plus the
complexity
of
the</p>
      <p>DQmethod
used
for
the
evaluation,
 Independent: In the implementation of this approach for the case study, the
DataProcessor queries a metadata repository with the data quality information on
the topics in the EventBus (see sections 3, and 4.1). Consequently, the latency can
be defined as a function of the data transfer (i.e., data querying from the metadata
].
latency the 1000 messages analysed. None of the approaches for Big Data Quality
Evaluation seems to be best in all cases. Generally, for Embedded and In-line
approaches, the latency introduced in the system by the Data Quality algorithms –
represented by the DQmethods– can be calculated as a function of the complexity of
the algorithms themselves. Regardless of the complexity, the algorithms will likely
process individual messages, ergo the latency will remain low. Even for the
Independent evaluation approach, where the algorithm typically processes entire Big Data
sources, the evaluation happens in an independently from the main data flow, causing


=
=
=
no impact in it. As a result, the latency due to the Data Quality algorithms alone can
remain low.</p>
      <p>In the cases where multiple messages are necessary to evaluate Data Quality, the
complexity of the algorithms can cause a larger impact. In such cases, it is likely that
the DataProcessors need to analyse multiple messages as well, making Parallel and
Independent (historical data) evaluations may be more appropriate. Even that the
Parallel evaluation approach suffers from synchronisation delays, the latency due to
synchronization of Big Data Analysis and Data Quality results would be much lower.</p>
      <p>Latency of data transfers between modules can differ from system to system.
Generally, it can be represented as a function directly proportional to the number and
duration of data transfers between modules. In the case study, the latency introduced
by data transfers remained relatively comparable to the latency of the DQmethods,
albeit it drove the main cost of the latency functions. This can be exacerbated in the
cases when the modules of the platform are distributed over a network rather than in
the same server –like in the case study. Considering that one of the main advantages
of asynchronous systems is the ability of distributing the modules over a network, it is
possible to assert that the latency of the data transfer will drive the main impact of the
Data Quality in real-time Big Data Systems since more transfers are added to enable
it. In such cases, the Embedded methods may be more appropriate since the data
transfer factor for those methods will likely be local or inexistent.
6</p>
    </sec>
    <sec id="sec-3">
      <title>Conclusions</title>
      <p>Multiple new technologies are being used to ingest data in scales never seen in the
past. From IoT to Social Networks, these technologies are generating data in volumes,
shapes, and at paces that set special requirements. Big Data Systems need to meet
those requirements in order to ingest, process, and providing insights on that data.</p>
      <p>Data quality is one of the main challenges when creating these insights. Literature
has addressed data quality dimensions like accuracy, completeness, or Consistency
more frequently in Big Data. Nonetheless, current DQ measures are simplistic for
real-time spatiotemporal data where recognising a DQ issue requires multiple
messages with an implied latency. Timelines is being identified as one of the most
important dimensions of data quality when it comes to real-time Big Data systems. Thus,
making decisions based on outdated data will utterly drive to incorrect insights.</p>
      <p>One of the contributions of this paper is a classification of the methods available in
the literature for Big Data Quality evaluation, from the performance of the real-time
systems angle. The identified categories include Embedded, Parallel, In-line, and
Independent methods. A second contribution is a benchmark of the different types of
methods from the classification in terms of the latency introduced in a real-time Big
Data system. The following actions have been conducted to enable this comparison.
First, four formulas have been designed to quantify the latency of the four categories.
Secondly, A real-time Big Data System has been deployed to serve as a framework.
Third, one method representing each category has been implemented in the
aforementioned framework. Fourth, the methods have been executed using data from a bus</p>
      <p>Impact of Data Quality in Real-Time Big Data Systems 85
catching prediction application in order to measure the latency. Finally, the results
have been analysed and the categories have been compared.</p>
      <p>
        Results suggests that the impact of Data Quality Evaluations differ depending on
the category of the method used. The main factors driving the impact in terms of
latency are the data transfers between Data Quality and the Data Analytics algorithms,
the synchronisation of messages, and the complexity of the Data Quality algorithms.
The nature of the Big Data system will strongly influence the design of the Big Data
Quality Evaluation [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], [31], [32]. Each Big Data system has different characteristics
(i.e., V’s) as well as different priorities or needs (e.g., analysis performance in terms
of latency or amount of processed data). For instance, Big Data Analysis using
fastpaced data (i.e., Velocity is high), the Data Quality evaluation will get outdated as
new data is analysed in the system. These aspects constraint how the data quality
measures can be executed.
7
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Caragliu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Bo</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Nijkamp</surname>
          </string-name>
          , 'Smart Cities in Europe',
          <source>Journal of Urban Technology</source>
          , vol.
          <volume>18</volume>
          , no.
          <issue>2</issue>
          , pp.
          <fpage>65</fpage>
          -
          <lpage>82</lpage>
          , Apr.
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Barnaghi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bermudez-Edo</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>R.</given-names>
            <surname>Tönjes</surname>
          </string-name>
          , '
          <article-title>Challenges for Quality of Data in Smart Cities'</article-title>
          ,
          <source>J. Data and Information Quality</source>
          , vol.
          <volume>6</volume>
          , no.
          <issue>2-3</issue>
          , p.
          <volume>6</volume>
          :
          <fpage>1</fpage>
          -
          <issue>6</issue>
          :
          <fpage>4</fpage>
          ,
          <string-name>
            <surname>Jun</surname>
          </string-name>
          .
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Kugler</surname>
          </string-name>
          et al.,
          <article-title>'Time related quality dimensions of urban remotely sensed Big Data'</article-title>
          , in International Archives of the Photogrammetry,
          <source>Remote Sensing and Spatial Information Sciences, Delft</source>
          , The Netherlands, Sep.
          <year>2018</year>
          , vol.
          <source>XLII-4</source>
          , pp.
          <fpage>315</fpage>
          -
          <lpage>320</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ghasemaghaei</surname>
          </string-name>
          and G. Calic, '
          <article-title>Can big data improve firm decision quality? The role of data quality and data diagnosticity'</article-title>
          ,
          <source>DSS</source>
          , vol.
          <volume>120</volume>
          , pp.
          <fpage>38</fpage>
          -
          <lpage>49</lpage>
          , May
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Janssen</surname>
          </string-name>
          , H. van der Voort,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Wahyudi</surname>
          </string-name>
          , '
          <article-title>Factors influencing big data decisionmaking quality'</article-title>
          ,
          <source>J. of Business Research</source>
          , vol.
          <volume>70</volume>
          , pp.
          <fpage>338</fpage>
          -
          <lpage>345</lpage>
          , Jan.
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R. Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          and
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Strong</surname>
          </string-name>
          , 'Beyond Accuracy:
          <article-title>What Data Quality Means to Data Consumers'</article-title>
          ,
          <string-name>
            <surname>J. Manage. Inf. Syst.</surname>
          </string-name>
          , vol.
          <volume>12</volume>
          , no.
          <issue>4</issue>
          , pp.
          <fpage>5</fpage>
          -
          <lpage>33</lpage>
          , Mar.
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Mezzanzanica</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Boselli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cesarini</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Mercorio</surname>
          </string-name>
          , '
          <article-title>A model-based evaluation of data quality activities in KDD'</article-title>
          ,
          <source>Information Processing &amp; Management</source>
          , vol.
          <volume>51</volume>
          , no.
          <issue>2</issue>
          , pp.
          <fpage>144</fpage>
          -
          <lpage>166</lpage>
          , Mar.
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>El Kadiri</surname>
          </string-name>
          et al., '
          <article-title>Current trends on ICT technologies for enterprise information systems'</article-title>
          , Computers in Industry, vol.
          <volume>79</volume>
          , pp.
          <fpage>14</fpage>
          -
          <lpage>33</lpage>
          , Jun.
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Mahdavinejad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rezvan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Barekatain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Adibi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Barnaghi</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A. P.</given-names>
            <surname>Sheth</surname>
          </string-name>
          , '
          <article-title>Machine learning for internet of things data analysis: a survey'</article-title>
          ,
          <source>Digital Communications and Networks</source>
          , vol.
          <volume>4</volume>
          , no.
          <issue>3</issue>
          , pp.
          <fpage>161</fpage>
          -
          <lpage>175</lpage>
          , Aug.
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>P.</given-names>
            <surname>Sotres</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Santana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Sánchez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lanza</surname>
          </string-name>
          , and L. Muñoz, '
          <article-title>Practical Lessons From the Deployment and Management of a Smart City Internet-of-Things Infrastructure: The SmartSantander Testbed Case'</article-title>
          ,
          <source>IEEE Access</source>
          , vol.
          <volume>5</volume>
          , pp.
          <fpage>14309</fpage>
          -
          <lpage>14322</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>H.</given-names>
            <surname>Baqa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. B.</given-names>
            <surname>Truong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Crespi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Lee</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F. L.</given-names>
            <surname>Gall</surname>
          </string-name>
          , '
          <article-title>Quality of Information as an indicator of Trust in the Internet of Things', in 17th IEE TrustCom/12th IEE BigDataSE</article-title>
          , Aug.
          <year>2018</year>
          , pp.
          <fpage>204</fpage>
          -
          <lpage>211</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>I.</given-names>
            <surname>Taleb</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Serhani</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Dssouli</surname>
          </string-name>
          , '
          <article-title>Big Data Quality: A Data Quality Profiling Model'</article-title>
          ,
          <source>in Services</source>
          <year>2019</year>
          ,
          <year>2019</year>
          , pp.
          <fpage>61</fpage>
          -
          <lpage>77</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>P.</given-names>
            <surname>Ceravolo</surname>
          </string-name>
          and E. Bellini, '
          <article-title>Towards Configurable Composite Data Quality Assessment'</article-title>
          ,
          <source>in 21st IEEE Conf. on Business Informatics, Jul</source>
          .
          <year>2019</year>
          , vol.
          <volume>01</volume>
          , pp.
          <fpage>249</fpage>
          -
          <lpage>257</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Immonen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Pääkkönen</surname>
          </string-name>
          , and E. Ovaska, '
          <article-title>Evaluating the Quality of Social Media Data in Big Data Architecture'</article-title>
          ,
          <source>IEEE Access</source>
          , vol.
          <volume>3</volume>
          , pp.
          <fpage>2028</fpage>
          -
          <lpage>2043</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <source>[15] [17] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32]</source>
          [33]
          <string-name>
            <given-names>D.</given-names>
            <surname>Firmani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mecella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Scannapieco</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Batini</surname>
          </string-name>
          , '
          <article-title>On the Meaningfulness of “Big Data Quality” (Invited Paper)'</article-title>
          ,
          <source>Data Sci. Eng</source>
          ., vol.
          <volume>1</volume>
          , no.
          <issue>1</issue>
          , pp.
          <fpage>6</fpage>
          -
          <lpage>20</lpage>
          , Mar.
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>M. T. Baldassarre</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <string-name>
            <surname>Caballero</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Caivano</surname>
            ,
            <given-names>B. Rivas</given-names>
          </string-name>
          <string-name>
            <surname>Garcia</surname>
            , and
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Piattini</surname>
          </string-name>
          , '
          <article-title>From Big Data to Smart Data: A Data Quality Perspective'</article-title>
          ,
          <source>in 1st ACM SIGSOFT Int. Workshop on Ensemble-Based Software Engineering</source>
          , New York, NY, USA,
          <year>2018</year>
          , pp.
          <fpage>19</fpage>
          -
          <lpage>24</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>D.</given-names>
            <surname>Becker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. D.</given-names>
            <surname>King</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>McMullen</surname>
          </string-name>
          , '
          <article-title>Big data, big data quality problem'</article-title>
          ,
          <source>in IEEE Int. Conf. on Big Data</source>
          , Santa Clara, CA, USA, Oct.
          <year>2015</year>
          , pp.
          <fpage>2644</fpage>
          -
          <lpage>2653</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <article-title>'Intelligent City Platform</article-title>
          . Smart Cambridge', Intelligent City Platform. Smart Cambridge. https://www.connectingcambridgeshire.co.
          <article-title>uk/smart-places/smartcambridge/data-intelligent-city-platform-icp/ (accessed Oct</article-title>
          .
          <volume>01</volume>
          ,
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>'Big</surname>
            <given-names>Data'</given-names>
          </string-name>
          , Gartner,
          <year>2020</year>
          . https://www.gartner.com/en/informationtechnology/glossary/big-data
          <source>(accessed Feb</source>
          .
          <volume>06</volume>
          ,
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <given-names>S.</given-names>
            <surname>Soares</surname>
          </string-name>
          ,
          <article-title>Big Data Governance: An emerging imperative, 342 vols</article-title>
          . MC Press,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <given-names>J.</given-names>
            <surname>Debattista</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lange</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Scerri</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          , 'Linked “
          <article-title>Big” Data: Towards a Manifold Increase in Big Data Value and Veracity'</article-title>
          ,
          <source>in IEEE/ACM 2nd Int. Symp. on Big Data Computing, Dec</source>
          .
          <year>2015</year>
          , pp.
          <fpage>92</fpage>
          -
          <lpage>98</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <given-names>W. L.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roy</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Underwood</surname>
          </string-name>
          , '
          <source>NIST Big Data Interoperability Framework:</source>
          volume
          <volume>6</volume>
          ,
          <string-name>
            <surname>Reference</surname>
            <given-names>Architecture</given-names>
          </string-name>
          <source>version 3'</source>
          , National Institute of Standards and Technology, Gaithersburg,
          <string-name>
            <surname>MD</surname>
          </string-name>
          ,
          <source>NIST SP 1500-6r2</source>
          , Oct.
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <given-names>R.</given-names>
            <surname>Clarke</surname>
          </string-name>
          , '
          <article-title>Big data, big risks'</article-title>
          ,
          <source>ISJ</source>
          , vol.
          <volume>26</volume>
          , no.
          <issue>1</issue>
          , pp.
          <fpage>77</fpage>
          -
          <lpage>90</lpage>
          , Jan.
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <given-names>L.</given-names>
            <surname>Ramaswamy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Lawson</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S. V.</given-names>
            <surname>Gogineni</surname>
          </string-name>
          , '
          <article-title>Towards a Quality-centric Big Data Architecture for Federated Sensor Services'</article-title>
          ,
          <source>in IEEE Int. Cong. on Big Data, Jun</source>
          .
          <year>2013</year>
          , pp.
          <fpage>86</fpage>
          -
          <lpage>93</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <given-names>M.</given-names>
            <surname>Mahdavi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Neutatz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Visengeriyeva</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Z.</given-names>
            <surname>Abedjan</surname>
          </string-name>
          , '
          <source>Towards Automated Data Cleaning Workflows', in Conf. on 'Lernen</source>
          , Wissen, Daten, Analysen', Berlin, Germany,
          <year>2019</year>
          , pp.
          <fpage>10</fpage>
          -
          <lpage>19</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <given-names>W.</given-names>
            <surname>Shi</surname>
          </string-name>
          et al.,
          <article-title>'An Integrated Data Preprocessing Framework Based on Apache Spark for Fault Diagnosis of Power Grid Equipment'</article-title>
          ,
          <source>J. Signal Processing Systems</source>
          , vol.
          <volume>86</volume>
          , no.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <issue>2-3</issue>
          , pp.
          <fpage>221</fpage>
          -
          <lpage>236</lpage>
          , Mar.
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <string-name>
            <given-names>M.</given-names>
            <surname>Farooqi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. A.</given-names>
            <surname>Khattak</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Imran</surname>
          </string-name>
          , '
          <article-title>Data Quality Techniques in the Internet of Things: Random Forest Regression'</article-title>
          ,
          <source>in 14th Int. Conf. on Emerging Technologies</source>
          , Nov.
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <year>2018</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>4</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <string-name>
            <given-names>S.</given-names>
            <surname>Aghabozorgi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Seyed</given-names>
            <surname>Shirkhorshidi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Ying</surname>
          </string-name>
          <string-name>
            <surname>Wah</surname>
          </string-name>
          , '
          <article-title>Time-series clustering - A decade review'</article-title>
          ,
          <source>Information Systems</source>
          , vol.
          <volume>53</volume>
          , pp.
          <fpage>16</fpage>
          -
          <lpage>38</lpage>
          , Oct.
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <string-name>
            <surname>M. ter Hofstede</surname>
          </string-name>
          , '
          <article-title>Quality-informed semi-automated event log generation for process mining'</article-title>
          , DSS, p.
          <fpage>113265</fpage>
          ,
          <string-name>
            <surname>Feb</surname>
          </string-name>
          .
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <string-name>
            <given-names>L.</given-names>
            <surname>Ehrlinger</surname>
          </string-name>
          and
          <string-name>
            <given-names>W.</given-names>
            <surname>Wöß</surname>
          </string-name>
          , '
          <source>Automated Schema Quality Measurement in Large-Scale Information Systems', in Data Quality and Trust in Big Data</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>16</fpage>
          -
          <lpage>31</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <string-name>
            <given-names>S.</given-names>
            <surname>Geisler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Quix</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Weber</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Jarke</surname>
          </string-name>
          , '
          <article-title>Ontology-Based Data Quality Management for Data Streams'</article-title>
          ,
          <source>JDIQ</source>
          , vol.
          <volume>7</volume>
          , no.
          <issue>4</issue>
          , p.
          <volume>18</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>18</lpage>
          :
          <fpage>34</fpage>
          ,
          <string-name>
            <surname>Oct</surname>
          </string-name>
          .
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , T. Wo,
          <string-name>
            <given-names>T.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Lin</surname>
          </string-name>
          , and Y. Liu, '
          <article-title>CarStream: an industrial system of big data processing for internet-of-vehicles'</article-title>
          ,
          <source>Proc. VLDB Endowment</source>
          , vol.
          <volume>10</volume>
          , pp.
          <fpage>1766</fpage>
          -
          <lpage>1777</lpage>
          , Aug.
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wiljes</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Cimiano</surname>
          </string-name>
          , '
          <article-title>Towards assured data quality and validation by data certification'</article-title>
          ,
          <source>in 1st Workshop on Linked Data Quality in 10th International Conference on Semantic Systems</source>
          , Leipzig, Germany, Sep.
          <year>2014</year>
          , vol.
          <volume>1215</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          'SmartCambridge'. https://smartcambridge.org/ (accessed Jun.
          <volume>03</volume>
          ,
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          <article-title>SmartCambridge/tfc_server</article-title>
          . Smart Cambridge,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          <string-name>
            <given-names>'Eclipse</given-names>
            <surname>Vert</surname>
          </string-name>
          .x'. https://vertx.io/ (accessed Jun.
          <volume>03</volume>
          ,
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>