=Paper=
{{Paper
|id=Vol-1746/paper-21
|storemode=property
|title=Data Science in Sensing Machine Generated Data
|pdfUrl=https://ceur-ws.org/Vol-1746/paper-21.pdf
|volume=Vol-1746
|authors=Ana Ktona,Inva Bilo,Denada Xhaja,Xheni Melo
|dblpUrl=https://dblp.org/rec/conf/rtacsit/KtonaBXM16
}}
==Data Science in Sensing Machine Generated Data==
Data science in sensing machine generated data
Ana (Resulaj) Ktona Inva Bilo Denada Xhaja
University of Tirana University of Gjirokastra University of Tirana
ana.ktona@fshn.edu.al ibilo@uogj.edu.al denada.xhaja@fshn.edu.al
Xheni Melo
University of Tirana
xheni.melo@fshn.edu.al
Abstract devices in their daily activities, even for most of them
has become an inevitable routine. Since that there is
The increasing recent advances in hardware
an increasing awareness in physical and mental
technology for mobile technology and sensor
health, it has become much easier to monitor many
processing has resulted in greater availability
health parameters through sensors embed on
of sensor generated data. For example, mobile
Smartphone-s or other related devices. It remains now
devices contain many sensors such as GPS,
that all this generated information, to be processed
accelerometers, gyroscope, magnetometer,
and to extract from them valuable information. In this
thermometer, etc., which produce large
paper, we describe the application of data mining
volumes of data over time. This has lead to a
techniques in a case study of identifying patterns to
need for principled methods for efficient
m-health sensor generated data. These technique will
sensor generated data processing. In this paper,
be used to build a model for outlier analysis, pattern
we describe the application of data mining
analysis, and prediction analysis.
techniques in a case study of identifying
patterns to m-health sensor generated data.
These technique will be used to build a model 2. The Impact of the Internet of Things on
for outlier analysis, pattern analysis, and Big Data
prediction analysis. Big data existed before the Internet of Things and the
Internet of Things is not the only source of big data.
Keywords: sensor, Internet of Things, big But, what is the impact of IoT on big data? This is
data, data processing. seen first in the storage of the data. The Internet of
Things and cloud storage make it easier to store the
1. Introduction large amounts of data that flow into companies every
day. IoT is also a source of data generation. The
In recent years humans have much more interactions connected devices and sensors are responsible for
with things because modern devices contain more collecting data, and that data joins other data to grow
sensors than ever [ThKS14]. The addition of these the amount of big data available to companies. Every
sensors into everyday devices has becoming day, sensors embedded into connected devices are
particularly apparent when reviewing the rate of gathering data and transmitting that data to central
global sales, because such devices are not constrained servers, which assist companies in making decisions.
only to developed economies. The Internet of Things (IoT) has been a major
Sensors embed on devices we use help us influence on the Big Data landscape. Now that
monitoring almost every area of our lives through millions of devices are connected and generate
applications such as: healthcare, economy, enormous volumes of data, should be considered the
telecommunication, etc. It is also important to note efficiency of data collection mechanism.
that the cost of sensors has been reduced considerably First, companies need to hire highly efficient data
in recent years, which has made the process of collection mechanisms. Second, companies are facing
collecting data easier. Nowadays people use these many security issues which are probably not
addressing with traditional ones. Third, not all data analyzed two realistic datasets: (1) Heterogeneity
generated by these devices is useful. Last, IoT Big Dataset for Human Activity Recognition
Data is changing our everyday lives at a fundamental [SBBPKDSJ15]; and (2) PAMAP2 dataset (Physical
level. Activity Monitoring) [SS12, RS12], both of them are
publicly available for the research community. The
first dataset is a dataset devised to benchmark human
3. Sensor Data Mining and Processing activity recognition algorithms (classification,
The enormous volume of data produced and automatic data segmentation, feature extraction, etc)
transmitted from sensing devices is considered a big containing heterogeneous sensors; while PAMAP2
data challenge. Sensor generated data brings great dataset provides a good basis to develop and evaluate
challenges especially in the processing phase, because data processing and classification techniques for
very often is needed real-time processing of a large physical activity monitoring. Data mining algorithms
volume of uncertain data. To deal with that, sensor applied on the two dataset are: J48 (C4.5) and Naive
data analytics is a growing field. Bayes. Data set is divided into two parts: training set
The large volumes of sensor data necessitate the and testing set. By the application of these data
design of efficient algorithms which require at most mining algorithms is seen how these sensor generated
one scan of the data (known as data stream mining data are classified, and their generated errors
algorithms). A main characteristic of IoT data (sensor respectively.
generated data) is the distributed storage, making thus
data mining a challenge task. Quantity and quality of 4.1. Heterogeneity Dataset for Human Activity
such data does not have the same rhythm; there is big Recognition
quantity but low quality of data coming from
The dataset contains the readings of two motion
heterogeneous sources. We have to deal with this
sensors1 commonly found in smart-phones, recorded
variety and noise in data which makes it difficult to
while nine users executed activities scripted in no
find and correct any errors. There is a need for
specific order carrying smart-watches and smart-
modification of data mining algorithms to suit big
phones. Activities performed by users are: biking,
data.
sitting, standing, walking, stair up and stair down.
It is much easier to create than to analyze data.
Dataset contains 10 attributes together with the
Data mining methods such as clustering,
activity performed by users (what we are going to
classification, frequent pattern mining, and outlier
predict). Attributes taken into account for analysis are
detection are often applied to sensor generated data in
arrival time, correlation time, axes X, axes Y, axes Z.
order to extract patterns from them [TLCY14]. This
data usually needs to be filtered for more effective
analysis. The challenge is that traditional mining 4.2. PAMAP2 dataset
algorithms are often not designed for real time The PAMAP dataset contains data from 24 activities2
processing methods. Therefore, new algorithms for and 9 subjects, wearing three IMUs (inertial
sensor generated data processing need to perform the measurement units) and a HR-monitor. The dataset
analytics in real time in order to make IoT more contain 54 attributes and the one we are going to
intelligent, thus providing smarter services. predict is what activity users do based on other
parameters. These activities are: lying, sitting,
4. Pattern Analysis standing, walking, running, cycling, Nordic walking,
This paper focuses on the data generated by sensors watching TV, computer work, car driving, ascending
in mobile-Health field. Such sensors devices are stairs, descending stairs, vacuum cleaning, ironing,
related to a server and are the source for big data. folding laundry, house cleaning, playing soccer, rope
Afterwards, there is a need for extracting patterns jumping.
from these data through different mining techniques
[BS15, MPPFM10, DR13, RKDT14, BYX10,MS10].
There exist many international companies that have
developed applications for tracking physical activity
throughout the world, among which are: Google
Fitbit, Apple HealthKit, Samsung SAMI, etc. The 1
Sensors used to gather activity data are gyroscope and
difficulty is on availability of such data, since most of accelerometer.
them are confidential information and therefore 2
everyday household and sport activities.
cannot be publicly available. In this paper we have
4.3. Data Mining Process algorithm for two datasets and sensor generated data
The Data Mining process consisted of the following in PAMAP2 Dataset serves better for physical
steps: activity recognition techniques.
1-Detection of adequate data: This involved the
finding of the sensor generated data that may be 6. Conclusions
relevant to this study for the data mining process. Nowadays people use sensor devices in their daily
2-Data Pre-Processing: The collected data was activities and the enormous volume of data produced
transformed in order to be processed by the data and transmitted from these devices is considered a big
mining algorithms: some unnecessary columns and data challenge. The implementation of Data Mining
rows were removed from the data set according to Techniques and Internet of Things in healthcare will
data mining best practices. This resulted in a data set provide good health conditions to people without the
of 938086 rows with 6 columns of Heterogeneity necessary presence of doctors and will influence
Dataset for Human Activity Recognition; and 249957 motivation for change in physical activity behavior.
rows with 53 columns of PAMAP2 Dataset.
In this paper, we presented two case studies on
3-Definition of Training Set: Classifiers were
applying data mining algorithms on historical sensor
independently trained for two datasets.
data (realistic data) in order to evaluate data
4-Algorithms Selection: The selected algorithms are
processing and classification techniques for human
J48 (C4.5) and Naive Bayes.
physical activity monitoring.
5-Training: Classifiers were produced by training the
J48 and Naive Bayes data mining algorithms on the
historical sensor data. References
6-Evaluation: Classifiers were evaluated using
training set and percentage split train/test set. The [ThKS14] B.Thirunavukarasu, T.Kalaikumaran,
performance metrics produced for each classifier S.Karthik. Integration of Data Mining and Internet
include Correctly Classified Instances (CCI) and Root of Things – Improved Athlete Performance And
mean squared error (RMSE). Health Care System. International Journal of
Technical Research and Applications e-ISSN: 2320-
5. Results and discussions 8163, Special Issue 11, 28-31, Nov-Dec 2014.
Classifiers were independently trained using training
data and percentage split train/test set for two [TLCY14] Ch.W.Tsai, Ch.F.Lai, M.Ch.Chiang, L.T.
datasets. Below are the tables with regarding Yang. Data Mining for Internet of Things: A Survey.
performance metrics (CCI - Correctly Classified IEEE Communications Surveys & Tutorials, Vol. 16,
Instances, RMSE - Root Mean Squared Error) of No. 1, First Quarter 2014.
applied algorithms for each of datasets.
[BS15] Sh.Bhatia, S.Patel. Analysis on different Data
Table 1 : Evaluation of Heterogeneity Dataset for mining Techniques and algorithms used in IOT. Int.
Human Activity Recognition Journal of Engineering Research and Applications,
J48 Naive Bayes
ISSN: 2248-9622, Vol. 5, Issue 11, (Part - 1), 82-85,
CCI RMSE CCI RMSE November 2015.
(%) (%)
Using training data 84,22 0,1905 66,1 0,2689
[MPPFM10] Alexandra Moraru, Marko Pesko, Maria
Using percentage split 77.15 0.276 66.30 0.2701
(70%) (70%)
Porcius, Carolina Fortuna, Dunja Mladenic. Using
(% of training set)
Machine Learning on Sensor Data. Journal of
Table 2 : Evaluation of PAMAP2 Dataset Computing and Information Technology - CIT 341–
347, 18, 2010, 4.
J48 Naive Bayes
CCI RMSE CCI RMSE
(%) (%)
[DR13] Ch.Dule, K.M.Rajasekharaiah. Page Sensor
Using training data 99.99 0.0033 96.84 0.0483 Data Mining Model and System Design: A Review.
Using percentage split 99.93 0.0079 96.88 0.0479 International Refereed Journal of Engineering and
(% of training set) (70%) (66%) Science (IRJES) ISSN (Online) 2319-183X, (Print)
2319-1821 Volume 2, Issue 6, ), 16-22, June 2013.
Experiments indicated that J48 (C4.5) algorithm
produces best results comparing to Naive Bayes
[RKDT14] A.Rook, A.Knauss, D.Damian, A.Thomo.
A Case Study of Applying Data Mining to Sensor
Data for Contextual Requirements Analysis. 978-1-
4799-6355-3/14, IEEE AIRE 2014, Karlskrona,
Sweden, 2014.
[BYX10] Sh.Bin, L.Yuan, W.Xiaoyi. Research on
Data Mining Models for the Internet of Things. IEEE
978-1-4244-5555-3/10.
[MS10] A.Mannini, A.M.Sabatini. Machine Learning
Methods for Classifying Human Physical Activity
from On-Body Accelerometers. Sensors 2010, 10,
1154-1175; doi:10.3390/s100201154.
[SBBPKDSJ15] A.Stisen, H. Blunck, S.Bhattacharya,
Th.S. Prentow, M.B.Kjærgaard, A.Dey, T.Sonne,
M.M.Jensen . Smart Devices are Different: Assessing
and Mitigating Mobile Sensing Heterogeneities for
Activity Recognition. In Proc. 13th ACM
Conference on Embedded Networked Sensor Systems
(SenSys 2015), Seoul, Korea, 2015.
[SS12] R. Stricker, D. Stricker. Introducing a New
Benchmarked Dataset for Activity Monitoring. The
16th IEEE International Symposium on Wearable
Computers (ISWC), 2012.
{RS12] A. Reiss, D. Stricker. Creating and
Benchmarking a New Dataset for Physical Activity
Monitoring. The 5th Workshop on Affect and
Behaviour Related Assistance (ABRA), 2012.