=Paper= {{Paper |id=Vol-1746/paper-21 |storemode=property |title=Data Science in Sensing Machine Generated Data |pdfUrl=https://ceur-ws.org/Vol-1746/paper-21.pdf |volume=Vol-1746 |authors=Ana Ktona,Inva Bilo,Denada Xhaja,Xheni Melo |dblpUrl=https://dblp.org/rec/conf/rtacsit/KtonaBXM16 }} ==Data Science in Sensing Machine Generated Data== https://ceur-ws.org/Vol-1746/paper-21.pdf
                       Data science in sensing machine generated data



      Ana (Resulaj) Ktona                             Inva Bilo                           Denada Xhaja
      University of Tirana                    University of Gjirokastra                 University of Tirana
     ana.ktona@fshn.edu.al                       ibilo@uogj.edu.al                   denada.xhaja@fshn.edu.al

                                                     Xheni Melo
                                                 University of Tirana
                                               xheni.melo@fshn.edu.al


                      Abstract                             devices in their daily activities, even for most of them
                                                           has become an inevitable routine. Since that there is
    The increasing recent advances in hardware
                                                           an increasing awareness in physical and mental
    technology for mobile technology and sensor
                                                           health, it has become much easier to monitor many
    processing has resulted in greater availability
                                                           health parameters through sensors embed on
    of sensor generated data. For example, mobile
                                                           Smartphone-s or other related devices. It remains now
    devices contain many sensors such as GPS,
                                                           that all this generated information, to be processed
    accelerometers, gyroscope, magnetometer,
                                                           and to extract from them valuable information. In this
    thermometer, etc., which produce large
                                                           paper, we describe the application of data mining
    volumes of data over time. This has lead to a
                                                           techniques in a case study of identifying patterns to
    need for principled methods for efficient
                                                           m-health sensor generated data. These technique will
    sensor generated data processing. In this paper,
                                                           be used to build a model for outlier analysis, pattern
    we describe the application of data mining
                                                           analysis, and prediction analysis.
    techniques in a case study of identifying
    patterns to m-health sensor generated data.
    These technique will be used to build a model          2. The Impact of the Internet of Things on
    for outlier analysis, pattern analysis, and            Big Data
    prediction analysis.                                   Big data existed before the Internet of Things and the
                                                           Internet of Things is not the only source of big data.
    Keywords: sensor, Internet of Things, big              But, what is the impact of IoT on big data? This is
    data, data processing.                                 seen first in the storage of the data. The Internet of
                                                           Things and cloud storage make it easier to store the
1. Introduction                                            large amounts of data that flow into companies every
                                                           day. IoT is also a source of data generation. The
In recent years humans have much more interactions         connected devices and sensors are responsible for
with things because modern devices contain more            collecting data, and that data joins other data to grow
sensors than ever [ThKS14]. The addition of these          the amount of big data available to companies. Every
sensors into everyday devices has becoming                 day, sensors embedded into connected devices are
particularly apparent when reviewing the rate of           gathering data and transmitting that data to central
global sales, because such devices are not constrained     servers, which assist companies in making decisions.
only to developed economies.                               The Internet of Things (IoT) has been a major
   Sensors embed on devices we use help us                 influence on the Big Data landscape. Now that
monitoring almost every area of our lives through          millions of devices are connected and generate
applications such as: healthcare, economy,                 enormous volumes of data, should be considered the
telecommunication, etc. It is also important to note       efficiency of data collection mechanism.
that the cost of sensors has been reduced considerably       First, companies need to hire highly efficient data
in recent years, which has made the process of             collection mechanisms. Second, companies are facing
collecting data easier. Nowadays people use these          many security issues which are probably not
addressing with traditional ones. Third, not all data       analyzed two realistic datasets: (1) Heterogeneity
generated by these devices is useful. Last, IoT Big         Dataset      for    Human       Activity   Recognition
Data is changing our everyday lives at a fundamental        [SBBPKDSJ15]; and (2) PAMAP2 dataset (Physical
level.                                                      Activity Monitoring) [SS12, RS12], both of them are
                                                            publicly available for the research community. The
                                                            first dataset is a dataset devised to benchmark human
3. Sensor Data Mining and Processing                        activity recognition algorithms (classification,
The enormous volume of data produced and                    automatic data segmentation, feature extraction, etc)
transmitted from sensing devices is considered a big        containing heterogeneous sensors; while PAMAP2
data challenge. Sensor generated data brings great          dataset provides a good basis to develop and evaluate
challenges especially in the processing phase, because      data processing and classification techniques for
very often is needed real-time processing of a large        physical activity monitoring. Data mining algorithms
volume of uncertain data. To deal with that, sensor         applied on the two dataset are: J48 (C4.5) and Naive
data analytics is a growing field.                          Bayes. Data set is divided into two parts: training set
   The large volumes of sensor data necessitate the         and testing set. By the application of these data
design of efficient algorithms which require at most        mining algorithms is seen how these sensor generated
one scan of the data (known as data stream mining           data are classified, and their generated errors
algorithms). A main characteristic of IoT data (sensor      respectively.
generated data) is the distributed storage, making thus
data mining a challenge task. Quantity and quality of       4.1. Heterogeneity Dataset for Human Activity
such data does not have the same rhythm; there is big       Recognition
quantity but low quality of data coming from
                                                            The dataset contains the readings of two motion
heterogeneous sources. We have to deal with this
                                                            sensors1 commonly found in smart-phones, recorded
variety and noise in data which makes it difficult to
                                                            while nine users executed activities scripted in no
find and correct any errors. There is a need for
                                                            specific order carrying smart-watches and smart-
modification of data mining algorithms to suit big
                                                            phones. Activities performed by users are: biking,
data.
                                                            sitting, standing, walking, stair up and stair down.
   It is much easier to create than to analyze data.
                                                            Dataset contains 10 attributes together with the
Data mining methods such as clustering,
                                                            activity performed by users (what we are going to
classification, frequent pattern mining, and outlier
                                                            predict). Attributes taken into account for analysis are
detection are often applied to sensor generated data in
                                                            arrival time, correlation time, axes X, axes Y, axes Z.
order to extract patterns from them [TLCY14]. This
data usually needs to be filtered for more effective
analysis. The challenge is that traditional mining          4.2. PAMAP2 dataset
algorithms are often not designed for real time             The PAMAP dataset contains data from 24 activities2
processing methods. Therefore, new algorithms for           and 9 subjects, wearing three IMUs (inertial
sensor generated data processing need to perform the        measurement units) and a HR-monitor. The dataset
analytics in real time in order to make IoT more            contain 54 attributes and the one we are going to
intelligent, thus providing smarter services.               predict is what activity users do based on other
                                                            parameters. These activities are: lying, sitting,
4. Pattern Analysis                                         standing, walking, running, cycling, Nordic walking,
This paper focuses on the data generated by sensors         watching TV, computer work, car driving, ascending
in mobile-Health field. Such sensors devices are            stairs, descending stairs, vacuum cleaning, ironing,
related to a server and are the source for big data.        folding laundry, house cleaning, playing soccer, rope
Afterwards, there is a need for extracting patterns         jumping.
from these data through different mining techniques
[BS15, MPPFM10, DR13, RKDT14, BYX10,MS10].
There exist many international companies that have
developed applications for tracking physical activity
throughout the world, among which are: Google
Fitbit, Apple HealthKit, Samsung SAMI, etc. The             1
                                                                Sensors used to gather activity data are gyroscope and
difficulty is on availability of such data, since most of       accelerometer.
them are confidential information and therefore             2
                                                                everyday household and sport activities.
cannot be publicly available. In this paper we have
4.3. Data Mining Process                                           algorithm for two datasets and sensor generated data
The Data Mining process consisted of the following                 in PAMAP2 Dataset serves better for physical
steps:                                                             activity recognition techniques.
1-Detection of adequate data: This involved the
finding of the sensor generated data that may be                   6. Conclusions
relevant to this study for the data mining process.                Nowadays people use sensor devices in their daily
2-Data Pre-Processing: The collected data was                      activities and the enormous volume of data produced
transformed in order to be processed by the data                   and transmitted from these devices is considered a big
mining algorithms: some unnecessary columns and                    data challenge. The implementation of Data Mining
rows were removed from the data set according to                   Techniques and Internet of Things in healthcare will
data mining best practices. This resulted in a data set            provide good health conditions to people without the
of 938086 rows with 6 columns of Heterogeneity                     necessary presence of doctors and will influence
Dataset for Human Activity Recognition; and 249957                 motivation for change in physical activity behavior.
rows with 53 columns of PAMAP2 Dataset.
                                                                     In this paper, we presented two case studies on
3-Definition of Training Set: Classifiers were
                                                                   applying data mining algorithms on historical sensor
independently trained for two datasets.
                                                                   data (realistic data) in order to evaluate data
4-Algorithms Selection: The selected algorithms are
                                                                   processing and classification techniques for human
J48 (C4.5) and Naive Bayes.
                                                                   physical activity monitoring.
5-Training: Classifiers were produced by training the
J48 and Naive Bayes data mining algorithms on the
historical sensor data.                                            References
6-Evaluation: Classifiers were evaluated using
training set and percentage split train/test set. The              [ThKS14] B.Thirunavukarasu, T.Kalaikumaran,
performance metrics produced for each classifier                   S.Karthik. Integration of Data Mining and Internet
include Correctly Classified Instances (CCI) and Root              of Things – Improved Athlete Performance And
mean squared error (RMSE).                                         Health Care System. International Journal of
                                                                   Technical Research and Applications e-ISSN: 2320-
5. Results and discussions                                         8163, Special Issue 11, 28-31, Nov-Dec 2014.
Classifiers were independently trained using training
data and percentage split train/test set for two                   [TLCY14] Ch.W.Tsai, Ch.F.Lai, M.Ch.Chiang, L.T.
datasets. Below are the tables with regarding                      Yang. Data Mining for Internet of Things: A Survey.
performance metrics (CCI - Correctly Classified                    IEEE Communications Surveys & Tutorials, Vol. 16,
Instances, RMSE - Root Mean Squared Error) of                      No. 1, First Quarter 2014.
applied algorithms for each of datasets.
                                                                   [BS15] Sh.Bhatia, S.Patel. Analysis on different Data
  Table 1 : Evaluation of Heterogeneity Dataset for                mining Techniques and algorithms used in IOT. Int.
            Human Activity Recognition                             Journal of Engineering Research and Applications,
                                 J48             Naive Bayes
                                                                   ISSN: 2248-9622, Vol. 5, Issue 11, (Part - 1), 82-85,
                          CCI          RMSE     CCI      RMSE      November 2015.
                          (%)                   (%)
Using training data      84,22     0,1905     66,1       0,2689
                                                                   [MPPFM10] Alexandra Moraru, Marko Pesko, Maria
Using percentage split   77.15     0.276      66.30      0.2701
                         (70%)                (70%)
                                                                   Porcius, Carolina Fortuna, Dunja Mladenic. Using
(% of training set)
                                                                   Machine Learning on Sensor Data. Journal of
       Table 2 : Evaluation of PAMAP2 Dataset                      Computing and Information Technology - CIT 341–
                                                                   347, 18, 2010, 4.
                                 J48              Naive Bayes
                          CCI          RMSE     CCI       RMSE
                          (%)                   (%)
                                                                   [DR13] Ch.Dule, K.M.Rajasekharaiah. Page Sensor
Using training data      99.99     0.0033     96.84       0.0483   Data Mining Model and System Design: A Review.
Using percentage split   99.93     0.0079     96.88      0.0479    International Refereed Journal of Engineering and
(% of training set)      (70%)                (66%)                Science (IRJES) ISSN (Online) 2319-183X, (Print)
                                                                   2319-1821 Volume 2, Issue 6, ), 16-22, June 2013.
Experiments indicated that J48 (C4.5) algorithm
produces best results comparing to Naive Bayes
[RKDT14] A.Rook, A.Knauss, D.Damian, A.Thomo.
A Case Study of Applying Data Mining to Sensor
Data for Contextual Requirements Analysis. 978-1-
4799-6355-3/14, IEEE AIRE 2014, Karlskrona,
Sweden, 2014.

[BYX10] Sh.Bin, L.Yuan, W.Xiaoyi. Research on
Data Mining Models for the Internet of Things. IEEE
978-1-4244-5555-3/10.

[MS10] A.Mannini, A.M.Sabatini. Machine Learning
Methods for Classifying Human Physical Activity
from On-Body Accelerometers. Sensors 2010, 10,
1154-1175; doi:10.3390/s100201154.

[SBBPKDSJ15] A.Stisen, H. Blunck, S.Bhattacharya,
Th.S. Prentow, M.B.Kjærgaard, A.Dey, T.Sonne,
M.M.Jensen . Smart Devices are Different: Assessing
and Mitigating Mobile Sensing Heterogeneities for
Activity Recognition. In Proc. 13th ACM
Conference on Embedded Networked Sensor Systems
(SenSys 2015), Seoul, Korea, 2015.

[SS12] R. Stricker, D. Stricker. Introducing a New
Benchmarked Dataset for Activity Monitoring. The
16th IEEE International Symposium on Wearable
Computers (ISWC), 2012.

{RS12] A. Reiss, D. Stricker. Creating and
Benchmarking a New Dataset for Physical Activity
Monitoring. The 5th Workshop on Affect and
Behaviour Related Assistance (ABRA), 2012.