<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Data science in sensing machine generated data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ana (Resulaj) Ktona</string-name>
          <email>ana.ktona@fshn.edu.al</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Inva Bilo</string-name>
          <email>ibilo@uogj.edu.al</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xheni Melo</string-name>
          <email>xheni.melo@fshn.edu.al</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Denada Xhaja</string-name>
          <email>denada.xhaja@fshn.edu.al</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Gjirokastra</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Tirana</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <abstract>
        <p>The increasing recent advances in hardware technology for mobile technology and sensor processing has resulted in greater availability of sensor generated data. For example, mobile devices contain many sensors such as GPS, accelerometers, gyroscope, magnetometer, thermometer, etc., which produce large volumes of data over time. This has lead to a need for principled methods for efficient sensor generated data processing. In this paper, we describe the application of data mining techniques in a case study of identifying patterns to m-health sensor generated data. These technique will be used to build a model for outlier analysis, pattern analysis, and prediction analysis.</p>
      </abstract>
      <kwd-group>
        <kwd>sensor</kwd>
        <kwd>Internet of Things</kwd>
        <kwd>big data</kwd>
        <kwd>data processing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>In recent years humans have much more interactions
with things because modern devices contain more
sensors than ever [ThKS14]. The addition of these
sensors into everyday devices has becoming
particularly apparent when reviewing the rate of
global sales, because such devices are not constrained
only to developed economies.</p>
      <p>Sensors embed on devices we use help us
monitoring almost every area of our lives through
applications such as: healthcare, economy,
telecommunication, etc. It is also important to note
that the cost of sensors has been reduced considerably
in recent years, which has made the process of
collecting data easier. Nowadays people use these
devices in their daily activities, even for most of them
has become an inevitable routine. Since that there is
an increasing awareness in physical and mental
health, it has become much easier to monitor many
health parameters through sensors embed on
Smartphone-s or other related devices. It remains now
that all this generated information, to be processed
and to extract from them valuable information. In this
paper, we describe the application of data mining
techniques in a case study of identifying patterns to
m-health sensor generated data. These technique will
be used to build a model for outlier analysis, pattern
analysis, and prediction analysis.</p>
    </sec>
    <sec id="sec-2">
      <title>2. The Impact of the Internet of Things on</title>
    </sec>
    <sec id="sec-3">
      <title>Big Data</title>
      <p>Big data existed before the Internet of Things and the
Internet of Things is not the only source of big data.
But, what is the impact of IoT on big data? This is
seen first in the storage of the data. The Internet of
Things and cloud storage make it easier to store the
large amounts of data that flow into companies every
day. IoT is also a source of data generation. The
connected devices and sensors are responsible for
collecting data, and that data joins other data to grow
the amount of big data available to companies. Every
day, sensors embedded into connected devices are
gathering data and transmitting that data to central
servers, which assist companies in making decisions.
The Internet of Things (IoT) has been a major
influence on the Big Data landscape. Now that
millions of devices are connected and generate
enormous volumes of data, should be considered the
efficiency of data collection mechanism.</p>
      <p>First, companies need to hire highly efficient data
collection mechanisms. Second, companies are facing
many security issues which are probably not
addressing with traditional ones. Third, not all data
generated by these devices is useful. Last, IoT Big
Data is changing our everyday lives at a fundamental
level.</p>
    </sec>
    <sec id="sec-4">
      <title>3. Sensor Data Mining and Processing</title>
      <p>The enormous volume of data produced and
transmitted from sensing devices is considered a big
data challenge. Sensor generated data brings great
challenges especially in the processing phase, because
very often is needed real-time processing of a large
volume of uncertain data. To deal with that, sensor
data analytics is a growing field.</p>
      <p>The large volumes of sensor data necessitate the
design of efficient algorithms which require at most
one scan of the data (known as data stream mining
algorithms). A main characteristic of IoT data (sensor
generated data) is the distributed storage, making thus
data mining a challenge task. Quantity and quality of
such data does not have the same rhythm; there is big
quantity but low quality of data coming from
heterogeneous sources. We have to deal with this
variety and noise in data which makes it difficult to
find and correct any errors. There is a need for
modification of data mining algorithms to suit big
data.</p>
      <p>It is much easier to create than to analyze data.
Data mining methods such as clustering,
classification, frequent pattern mining, and outlier
detection are often applied to sensor generated data in
order to extract patterns from them [TLCY14]. This
data usually needs to be filtered for more effective
analysis. The challenge is that traditional mining
algorithms are often not designed for real time
processing methods. Therefore, new algorithms for
sensor generated data processing need to perform the
analytics in real time in order to make IoT more
intelligent, thus providing smarter services.</p>
    </sec>
    <sec id="sec-5">
      <title>4. Pattern Analysis</title>
      <p>This paper focuses on the data generated by sensors
in mobile-Health field. Such sensors devices are
related to a server and are the source for big data.
Afterwards, there is a need for extracting patterns
from these data through different mining techniques
[BS15, MPPFM10, DR13, RKDT14, BYX10,MS10].
There exist many international companies that have
developed applications for tracking physical activity
throughout the world, among which are: Google
Fitbit, Apple HealthKit, Samsung SAMI, etc. The
difficulty is on availability of such data, since most of
them are confidential information and therefore
cannot be publicly available. In this paper we have
analyzed two realistic datasets: (1) Heterogeneity
Dataset for Human Activity Recognition
[SBBPKDSJ15]; and (2) PAMAP2 dataset (Physical
Activity Monitoring) [SS12, RS12], both of them are
publicly available for the research community. The
first dataset is a dataset devised to benchmark human
activity recognition algorithms (classification,
automatic data segmentation, feature extraction, etc)
containing heterogeneous sensors; while PAMAP2
dataset provides a good basis to develop and evaluate
data processing and classification techniques for
physical activity monitoring. Data mining algorithms
applied on the two dataset are: J48 (C4.5) and Naive
Bayes. Data set is divided into two parts: training set
and testing set. By the application of these data
mining algorithms is seen how these sensor generated
data are classified, and their generated errors
respectively.</p>
      <sec id="sec-5-1">
        <title>4.1. Heterogeneity Dataset for Human Activity</title>
      </sec>
      <sec id="sec-5-2">
        <title>Recognition</title>
        <p>The dataset contains the readings of two motion
sensors1 commonly found in smart-phones, recorded
while nine users executed activities scripted in no
specific order carrying smart-watches and
smartphones. Activities performed by users are: biking,
sitting, standing, walking, stair up and stair down.
Dataset contains 10 attributes together with the
activity performed by users (what we are going to
predict). Attributes taken into account for analysis are
arrival time, correlation time, axes X, axes Y, axes Z.</p>
      </sec>
      <sec id="sec-5-3">
        <title>4.2. PAMAP2 dataset</title>
        <p>The PAMAP dataset contains data from 24 activities2
and 9 subjects, wearing three IMUs (inertial
measurement units) and a HR-monitor. The dataset
contain 54 attributes and the one we are going to
predict is what activity users do based on other
parameters. These activities are: lying, sitting,
standing, walking, running, cycling, Nordic walking,
watching TV, computer work, car driving, ascending
stairs, descending stairs, vacuum cleaning, ironing,
folding laundry, house cleaning, playing soccer, rope
jumping.
1 Sensors used to gather activity data are gyroscope and
accelerometer.</p>
        <sec id="sec-5-3-1">
          <title>2 everyday household and sport activities.</title>
        </sec>
      </sec>
      <sec id="sec-5-4">
        <title>4.3. Data Mining Process</title>
        <p>The Data Mining process consisted of the following
steps:
1-Detection of adequate data: This involved the
finding of the sensor generated data that may be
relevant to this study for the data mining process.
2-Data Pre-Processing: The collected data was
transformed in order to be processed by the data
mining algorithms: some unnecessary columns and
rows were removed from the data set according to
data mining best practices. This resulted in a data set
of 938086 rows with 6 columns of Heterogeneity
Dataset for Human Activity Recognition; and 249957
rows with 53 columns of PAMAP2 Dataset.
3-Definition of Training Set: Classifiers were
independently trained for two datasets.
4-Algorithms Selection: The selected algorithms are
J48 (C4.5) and Naive Bayes.
5-Training: Classifiers were produced by training the
J48 and Naive Bayes data mining algorithms on the
historical sensor data.
6-Evaluation: Classifiers were evaluated using
training set and percentage split train/test set. The
performance metrics produced for each classifier
include Correctly Classified Instances (CCI) and Root
mean squared error (RMSE).</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Results and discussions</title>
      <p>Classifiers were independently trained using training
data and percentage split train/test set for two
datasets. Below are the tables with regarding
performance metrics (CCI - Correctly Classified
Instances, RMSE - Root Mean Squared Error) of
applied algorithms for each of datasets.</p>
    </sec>
    <sec id="sec-7">
      <title>6. Conclusions</title>
      <p>Nowadays people use sensor devices in their daily
activities and the enormous volume of data produced
and transmitted from these devices is considered a big
data challenge. The implementation of Data Mining
Techniques and Internet of Things in healthcare will
provide good health conditions to people without the
necessary presence of doctors and will influence
motivation for change in physical activity behavior.</p>
      <p>In this paper, we presented two case studies on
applying data mining algorithms on historical sensor
data (realistic data) in order to evaluate data
processing and classification techniques for human
physical activity monitoring.
[ThKS14]</p>
      <sec id="sec-7-1">
        <title>B.Thirunavukarasu,</title>
      </sec>
      <sec id="sec-7-2">
        <title>T.Kalaikumaran,</title>
        <p>S.Karthik. Integration of Data Mining and Internet
of Things – Improved Athlete Performance And
Health Care System. International Journal of
Technical Research and Applications e-ISSN:
23208163, Special Issue 11, 28-31, Nov-Dec 2014.
[TLCY14] Ch.W.Tsai, Ch.F.Lai, M.Ch.Chiang, L.T.
Yang. Data Mining for Internet of Things: A Survey.
IEEE Communications Surveys &amp; Tutorials, Vol. 16,
No. 1, First Quarter 2014.
[BS15] Sh.Bhatia, S.Patel. Analysis on different Data
mining Techniques and algorithms used in IOT. Int.
Journal of Engineering Research and Applications,
ISSN: 2248-9622, Vol. 5, Issue 11, (Part - 1), 82-85,
November 2015.
[MPPFM10] Alexandra Moraru, Marko Pesko, Maria
Porcius, Carolina Fortuna, Dunja Mladenic. Using
Machine Learning on Sensor Data. Journal of
Computing and Information Technology - CIT 341–
347, 18, 2010, 4.
[DR13] Ch.Dule, K.M.Rajasekharaiah. Page Sensor
Data Mining Model and System Design: A Review.
International Refereed Journal of Engineering and
Science (IRJES) ISSN (Online) 2319-183X, (Print)
2319-1821 Volume 2, Issue 6, ), 16-22, June 2013.
Using training data
Using percentage split
(% of training set)
Using training data
Using percentage split
(% of training set)
Experiments indicated that J48 (C4.5) algorithm
produces best results comparing to Naive Bayes
CCI
(%)
84,22
77.15
(70%)
CCI
(%)
99.99
99.93
(70%)</p>
        <p>J48</p>
        <p>RMSE
0,1905
0.276
J48</p>
        <p>RMSE
0.0033
0.0079</p>
        <p>Naive Bayes
CCI RMSE
(%)
66,1 0,2689
66.30 0.2701
(70%)</p>
        <p>Naive Bayes
CCI RMSE
(%)
96.84 0.0483
96.88 0.0479
(66%)
[BYX10] Sh.Bin, L.Yuan, W.Xiaoyi. Research on
Data Mining Models for the Internet of Things. IEEE
978-1-4244-5555-3/10.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>