<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Data Quality assessment and enhancement on Social and Sensor Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gabriel R. Caldas de Aquino</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Claudio Miceli de Farias</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universidade Federal do Rio de Janeiro - Programa de Pos-Graduac~ao em Informatica UFRJ</institution>
          ,
          <addr-line>Rio de Janeiro</addr-line>
          ,
          <country country="BR">Brasil</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Smartphones are key devices in the Internet of Things paradigm. Social networking services on the Internet can use smartphones applications as data providers. The data gathered from sensors and data harvested from social networking services can be used by di erent applications for providing context-aware services. However, the excellence of the data oriented services depends on the Data quality (DQ). DQ is critical for decision making mechanisms. We present the problem related to DQ when dealing with social and sensor data. Also, we present and explore a framework whose objective is to evaluate and control DQ aspects when dealing with social and sensor data.</p>
      </abstract>
      <kwd-group>
        <kwd>Data Quality Social Network Internet of Things</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        The Internet of Things (IoT) paradigm embraces several types of smart devices
which are composed by sensors, actuators and other devices with networking
capabilities [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In the context of IoT, smartphones are key devices embedded
with di erent types of sensors. Smartphones are providers of large amounts of
environmental data. In addition, social networking platforms use smartphone
applications to enable the interaction of their users with the social networks
anywhere and anytime. This creates a pervasive channel for users to record and
share their personal activities on social platforms [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] (e.g. Twitter).
      </p>
      <p>
        Huge data repositories are produced in this scenario. Services can use data
fusion [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] techniques on sensor and social networking platforms data to perform
environment actuations. Also data analytics procedures can use social and sensor
data in a complementary manner for enriching the data analysis. This enables
the use of the data gathered from sensors and data harvested from social media
to create contextual integrated services [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        However, the excellence of the aforementioned services depends on Data
Quality (DQ) aspects [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] of the social and sensor data that is consumed. Quality
is deemed as a critical requirement for decision making mechanisms, applications
? The authors of this research paper would like to thank Kontron for providing the
private cloud server Symkloud as the basic infrastructure used in this work.
and services. Data with poor DQ aspects can lead to erroneous decisions and
analysis. In this work we de ne DQ as a concept that refers to how well the data
corresponds to the quality necessities of data consumers [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. An interesting
fact comes from the aforementioned de nition: DQ refers to the tness of which
the data is perceived by its consumer. This means that DQ would hardly be seen
in the same way by di erent users. In fact, each data consumer requires the used
data to ful ll certain criteria which he presumes essential for his own tasks at
hand. DQ standardizing is the process of making the data conforms certain DQ
requirements by assessing and enhancing DQ. The problem of DQ assessment is
commonly addressed through DQ dimensions [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. DQ enhancement can be done
through DQ enhancement techniques [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. In this work we present and explore
a framework for social and sensor DQ standardizing. The rest of this work is
divided into 3 chapters: in chapter 2 we present the Data Quality Concepts; in
chapter 3 we present the Framework for Social and Sensor DQ and in chapter 4
we conclude our work.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Data Quality Concepts</title>
      <p>
        In this chapter we brie y discuss some important Data Quality (DQ) concepts.
DQ is deemed as a critical requirement for decision making mechanisms,
applications and services on the IoT. There are di erent de nitions of DQ. In this work
DQ can be de ned as a concept that refers to how well the data corresponds to
the necessities of data processing mechanisms [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. An interesting conclusion
comes from the aforementioned de nition: DQ refers the tness of which data
is perceived by its consumer. This means that DQ would hardly be seen in the
same way by di erent users. In fact, each data consumer requires the used data
to ful ll certain criteria which he presumes essential for his own tasks at hand.
The problem of DQ assessment is commonly addressed through data quality
dimensions [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The ISO international standard DQ model identi es several DQ
characteristics in the context of Software Engineering [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. According to [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], the
dimensions of data quality data can be categorized into four semantic aspects:
(i) intrinsic , (ii) accessibility, (iii) contextual, and (iv) representational.
      </p>
      <p>
        The (i) intrinsic data quality semantic aspect is related to the quality of
the data in relation to itself. As examples of dimensions there are: accuracy,
objectivity, believability and reputation [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. (ii) The accessibility data quality
semantic dimensions describe how accessible the data is for data consumers.
Examples can be accessibility and access security [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Contextual data quality
semantic aspect is related to how appropriate the data is for its usage. As
examples of dimensions can be relevancy, value-added , timeliness, completeness
and amount of data [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Representational data quality semantic aspect describes
how understandable and representative of the environment the data is. Examples
of dimensions are interoperability, ease of understanding, concise representation
and consistent representation [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. The DQ challenges refers to di culties
affecting any DQ dimension. Such challenges can cause data to become entirely or
partially unusable, since it may not meet the requirements of data consumers.
As we discussed, DQ does not relate only to data accuracy. Instead data quality
problems can surpass the accuracy dimension to other dimensions such as the
aforementioned.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Framework for Social and Sensor Data Quality</title>
      <p>This section presents the framework for social and sensor DQ standardizing.
Standardizing DQ is making the data conforms certain DQ requirements. This
framework receives data as input and transforms the input data to conform the
DQ requirements of a given application or system that will receive such data as
the output of the framework. The framework has two components: (i) social DQ
component and (ii) sensor DQ component.</p>
      <p>
        The (i) Social DQ component is responsible for standardizing social-originated
data according to DQ requirements. The (ii) Sensor DQ component is
responsible for standardizing sensor-originated data according to DQ requirements. Both
the (i) and (ii) components present two subcomponents: a rst subcomponent
is responsible for DQ assessment, while the second subcomponent is responsible
for DQ enhancement. DQ assessment subcomponent is responsible for the DQ
evaluation according to given DQ requirements. DQ enhancement subcomponent
is responsible for enhancing DQ to a given DQ requirement. The framework is
illustrated at Figure 1.
Social DQ component is responsible for social DQ standardizing. The rst
challenge for the Social DQ component is that social-originated data present
significant di erences according to the social media that originated the data. Di
erences can be if the data is structured or if it is unstructured [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], can be about the
type of social interaction(the understanding of a message posted in forum thread
is di erent from the understanding of a message posted in a Twitter discussion
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]), etc. In other words, social data should be analyzed di erently, according to
the social platform that originated the message.
      </p>
      <p>These di erences in uence the development of Social DQ analysis solutions.
It is important to notice that di erent Social platforms may request a priori
specialized DQ analysis solutions. The social DQ component is composed by
two subcomponents: (i) Social DQ assessment subcomponent and (ii) Social DQ
enhancement subcomponent.</p>
      <p>
        The Social DQ assessment subcomponent aims at assessing the DQ
from data originated from social networks. According to [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], Social DQ can be
degraded by: (i) Keyword ambiguity and (ii) Users Spamming. (i) Keyword
ambiguity is the addressing of a same context through di erent keywords. Even in
some cases the correlation may not be obvious. Also, in the opposite, di erent
contexts can be addressed by a common keyword. Since social data is generally
collected through keyword searches, the Keyword ambiguity impacts the overall
collected DQ. (ii) Users Spamming is the use of trending keywords to massively
propagate a message. In some cases, such messages may not present a
correlation with to the context in which the keyword is inserted. Instead, users can use
trending keywords to leverage the visualization rate of the message. Spam
messages can lead to the misunderstanding about the keyword context. According
to [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], a key characteristic of Spam messages is its neutral tone ( e.g., Check out
this coupon).
      </p>
      <p>The Social DQ enhancement subcomponent aims at processing social
data to enhance the social DQ to correspond a DQ requirement of a data
consumer. Given the result of the Social DQ assessment subcomponent, the Social
DQ enhancement subcomponent performs the following actions: (i) accept the
data to be given to the data consumer, (ii) enhance the data to the corresponding
DQ requirement of the data consumer or (iii) discard the data.
3.2</p>
      <p>
        Sensor DQ Component
Sensor DQ component is responsible for standardizing sensor-originated data.
However, assessing and enhancing DQ for sensor data is not trivial. The
sensory platforms are heterogeneous and resource-constrained. Sensor data can be
originated from di erent kinds of sensor devices (e.g. smartphones, smart sensor
networks, etc. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]). Di erent sensor devices can present di erent data precision,
data ranges, data units, hardware speci cations, etc.
      </p>
      <p>Regarding the sensed environment, sensory devices are placed in uncontrolled
environments. In such case, misplacement, communication errors, power failure,
sensor malfunction, human error or intentional misuse can potentially degrade
sensor-originated DQ. The sensor DQ component is composed by two
subcomponents: (i) Sensor DQ assessment subcomponent and (ii) Sensor DQ enhancement
subcomponent.</p>
      <p>
        The Sensor DQ assessment subcomponent has the objective of
assessing DQ for data-collected by sensors. Much of the sensor DQ assessment can be
performed on sensor data for assessing DQ according to the dimensions
mentioned in the session 2. Particularly, the dimensions of believability (comparison
with the correct operating bounds), completeness (missing values), free-of-error
(erroneous values), consistency (over time), timeliness (delay), accuracy
(deviation from true value) and precision (granularity of readings) are all important
aspects of high-quality sensor data [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. It is necessary that the sensor data well
represents the events that originated the data. It implies that the data collected
by multiple sensors should be processed through data fusion techniques [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] while
maintaining the data consistency when representing the underlying phenomenon
that originated such data.
      </p>
      <p>
        The Sensor DQ enhancement subcomponent aims at the enhancement
of sensor-originated DQ. Since the sensor data is constantly being integrated by
data fusion procedures, it is important to perform DQ enhancement on the y, as
the data is being collected and processed. The on-the- y data processing avoid
erroneous and low quality data to propagate on fusion procedures. Also, after
data fusion procedures, applying well de ned DQ enhancement procedures can
avoid the production of low quality data. According to [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] there are ve major
DQ enhancement techniques: outlier detection, interpolation, data integration,
data deduplication and data cleaning.
      </p>
      <p>(i) Outlier detection helps to improve the overall quality of datasets by
making them more consistent. Moreover, outlier detection is concerned about
handling instances of the unreliable datasets. Metrics used in outlier detection
techniques focus on enhancing the di erence between data values in order to
identify outliers. (ii) Interpolation consists of inferring missing values based on
other (available) values. Missing values represent gaps in available data about
a certain entity or phenomena of interest for the user. As knowledge deriving
processes use these datasets as input, these gaps could also lead to incomplete
knowledge or wrong decisions which means that missing values could lead to
a decrease in DQ. (iii) Data integration is important since social and sensor
data come from di erent sensing platforms and di erent environments. In order
to be used, these data need to overcome their structure di erences and
inconsistencies to become truly bene cial for the various services. Data integration
solutions mainly focus on resolving the inconsistencies between the various data
streams. (iv) Data deduplication is a data compression mechanism aiming to
reduce data handling's resources consumption by reducing the amount of
available data through removing of duplicate data items and replacing them with a
pointer to the unique remaining copy. Data deduplication is quite simply a
removal process of redundant data items. (v) Data cleaning is a process composed
of 3 main phases: (i) Determination of error types, (ii) Identi cation of potential
errors and (iii) the correction of identi ed (potential) errors
4</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion and Future Works</title>
      <p>In this work we presented the problem related to DQ when dealing with social
and sensor data standardization. In this work we de ned that standardizing DQ
is making the data conforms certain DQ requirements. Also, we presented and
explored a framework for social and sensor DQ standardizing. The framework
we presented has the primary objective of standardizing social and sensor DQ
according to an application or system DQ requirements. The proposed
framework has two components: social DQ and sensor DQ components. Social DQ
component is responsible for standardizing social-originated data according to
DQ requirements, while Sensor DQ component is responsible for standardizing
sensor-originated data according to DQ requirements. Also, each component is
composed by two subcomponents: DQ assessment and DQ enhancement
subcomponents. DQ assessment subcomponent is responsible for the DQ evaluation
according to given DQ requirements. DQ enhancement subcomponent is
responsible for enhancing DQ to a given DQ requirement.</p>
      <p>For the realization of a framework for social and sensor DQ standardizing, a
future work is to systematically study the DQ requirements for di erent types
of applications that deal with social and sensor data. This future work aims at
directing the research for solving DQ standardizing problems. Another future
works that directs solutions for the problem of DQ standardizing are related to
the development of techniques for assessing and enhancing sensor and social DQ
aspects. We direct future works to create techniques to assess and enhance DQ
considering the diverse social and sensor data inputs.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Atzori</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Iera</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morabito</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          :
          <article-title>The internet of things: A survey</article-title>
          .
          <source>Computer networks</source>
          <volume>54</volume>
          (
          <issue>15</issue>
          ),
          <volume>2787</volume>
          {
          <fpage>2805</fpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Czernek</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Social measurement depends on data quantity and quality - tech report (</article-title>
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Farias</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pirmez</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Delicato</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Carmo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zomaya</surname>
          </string-name>
          , A.Y.,
          <string-name>
            <surname>de Souza</surname>
            ,
            <given-names>J.N.</given-names>
          </string-name>
          :
          <article-title>Multisensor data fusion in shared sensor and actuator networks</article-title>
          .
          <source>In: Information Fusion (FUSION)</source>
          ,
          <year>2014</year>
          17th International Conference on. pp.
          <volume>1</volume>
          {
          <issue>8</issue>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Ferreira</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferreira</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Towards altruistic data quality assessment for mobile sensing</article-title>
          .
          <source>In: Proceedings of the 2017 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2017 ACM International Symposium on Wearable Computers</source>
          . pp.
          <volume>464</volume>
          {
          <fpage>469</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Karkouch</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mousannif</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          , Al Moatassime,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Noel</surname>
          </string-name>
          ,
          <string-name>
            <surname>T.</surname>
          </string-name>
          :
          <article-title>Data quality in internet of things: A state-of-the-art survey</article-title>
          .
          <source>Journal of Network and Computer Applications</source>
          <volume>73</volume>
          , 57{
          <fpage>81</fpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Labouseur</surname>
            ,
            <given-names>A.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Matheus</surname>
            ,
            <given-names>C.C.</given-names>
          </string-name>
          :
          <article-title>An introduction to dynamic data quality challenges</article-title>
          .
          <source>Journal of Data and Information Quality (JDIQ) 8</source>
          (
          <issue>2</issue>
          ),
          <volume>6</volume>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Nakamura</surname>
            ,
            <given-names>E.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Loureiro</surname>
            ,
            <given-names>A.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frery</surname>
            ,
            <given-names>A.C.</given-names>
          </string-name>
          :
          <article-title>Information fusion for wireless sensor networks: Methods, models, and classi cations</article-title>
          .
          <source>ACM Computing Surveys (CSUR) 39(3)</source>
          ,
          <volume>9</volume>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8. for Standardization/International Electrotechnical Commission,
          <string-name>
            <surname>I.O.</surname>
          </string-name>
          , et al.:
          <article-title>Software engineering-software product quality requirements and evaluation (square) data quality model</article-title>
          .
          <source>ISO/IEC 25012</source>
          ,
          <issue>1</issue>
          {
          <fpage>13</fpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Tarasconi</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Farina</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mazzei</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bosca</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>The role of unstructured data in real-time disaster-related social media monitoring</article-title>
          .
          <source>In: 2017 IEEE International Conference on Big Data (Big Data)</source>
          . pp.
          <volume>3769</volume>
          {
          <issue>3778</issue>
          (Dec
          <year>2017</year>
          ). https://doi.org/10.1109/BigData.
          <year>2017</year>
          .8258377
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>R.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Strong</surname>
            ,
            <given-names>D.M.:</given-names>
          </string-name>
          <article-title>Beyond accuracy: What data quality means to data consumers</article-title>
          .
          <source>Journal of management information systems 12(4)</source>
          ,
          <volume>5</volume>
          {
          <fpage>33</fpage>
          (
          <year>1996</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Yerva</surname>
            ,
            <given-names>S.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jeung</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aberer</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Cloud based social and sensor data fusion</article-title>
          .
          <source>In: 2012 15th International Conference on Information Fusion</source>
          . pp.
          <volume>2494</volume>
          {
          <issue>2501</issue>
          (
          <year>July 2012</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>