=Paper= {{Paper |id=Vol-1743/ks1 |storemode=property |title=Mining Internet of Things (IoT) Big Data Streams |pdfUrl=https://ceur-ws.org/Vol-1743/ks1.pdf |volume=Vol-1743 |authors=Albert Bifet |dblpUrl=https://dblp.org/rec/conf/simbig/Bifet16 }} ==Mining Internet of Things (IoT) Big Data Streams== https://ceur-ws.org/Vol-1743/ks1.pdf
                 Mining Internet of Things (IoT) Big Data Streams

                                       Albert Bifet
                              LTCI, CNRS, Télécom ParisTech
                                  Université Paris-Saclay
                              75634 Paris Cedex 13, FRANCE
                         albert.bifet@telecom-paristech.fr


                    Abstract                               of Big Data infrastructures. How to do this ac-
                                                           curately in real time is the main challenge for IoT
    Big Data and the Internet of Things (IoT)              analytics systems in the near future.
    have the potential to fundamentally shift                 In the IoT data stream model, data arrives at
    the way we interact with our surroundings.             high speed, and algorithms that process it must
    The challenge of deriving insights from                do so under very strict constraints of space and
    the Internet of Things (IoT) has been rec-             time. Consequently, data streams pose several
    ognized as one of the most exciting and                challenges for data mining algorithm design. First,
    key opportunities for both academia and                algorithms must work within limited resources
    industry. Advanced analysis of big data                (time and memory). Second, they must deal with
    streams from sensors and devices is bound              data whose nature or distribution changes over
    to become a key area of data mining re-                time. We need to deal with resources in an effi-
    search as the number of applications re-               cient and low-cost way. In data stream mining, we
    quiring such processing increases. Deal-               are interested in three main dimensions:
    ing with the evolution over time of such
    data streams, i.e., with concepts that drift             • accuracy
    or change completely, is one of the core
                                                             • amount of space (computer memory) neces-
    issues in stream mining. Dealing with
                                                               sary
    this setting, MOA is a software frame-
    work with classification, regression, and                • time required to learn from training examples
    frequent pattern methods, and the new                      and to predict
    A PACHE SAMOA is a distributed stream-
    ing software for mining IoT data streams.              These dimensions are typically interdependent:
                                                           adjusting the time and space used by an algorithm
1   Introduction                                           can influence its accuracy. By storing more pre-
                                                           computed information, such as look up tables, an
The Internet of Things (IoT), the large network of
                                                           algorithm can run faster at the expense of space.
physical devices that extends beyond the typical
                                                           An algorithm can also run faster by processing less
computer networks, will be creating a huge quan-
                                                           information, either by stopping early or storing
tity of Big Data streams in real time in the next
                                                           less, thus having less data to process. The more
future. The realization of IoT depends on being
                                                           time an algorithm has, the more likely it is that ac-
able to gain the insights hidden in the vast and
                                                           curacy can be increased.
growing seas of data available. Since current ap-
                                                              IoT data streams are closely related to Big Data.
proaches don’t scale to Internet of Things (IoT)
                                                           Big Data is a new term used to identify the datasets
volumes, new systems with novel mining tech-
                                                           that due to their large size, we can not manage
niques are necessary due to the velocity, but also
                                                           them with the typical data mining software tools.
variety, and variability, of such data.
                                                           Instead of defining “Big Data” as datasets of a con-
   This IoT setting is challenging, and needs algo-        crete large size, for example in the order of mag-
rithms that use an extremely small amount (iota)           nitude of petabytes, the definition is related to the
of time and memory resources, and that are able to         fact that the dataset is too big to be managed with-
adapt to changes and not to stop learning. These           out using new algorithms or technologies. There
algorithms should be distributed and run on top            is need for new algorithms, and new tools to deal



                                                      15
with all of this data. Doug Laney (Laney, 2001)             3.1    High Level Architecture
was the first to mention the 3 V’s of Big Data man-         We identify three types of A PACHE SAMOA users:
agement: Volume, Variety and Velocity.
                                                                1. Platform users, who use available ML algo-
2   MOA                                                            rithms without implementing new ones.
Massive Online Analysis (MOA) (Bifet et al.,                    2. ML developers, who develop new ML algo-
2010) is a software environment for implement-                     rithms on top of A PACHE SAMOA and want
ing algorithms and running experiments for on-                     to be isolated from changes in the underlying
line learning from evolving data streams. MOA                      SPEs.
includes a collection of offline and online meth-
ods as well as tools for evaluation. In particular,             3. Platform developers, who extend A PACHE
it implements boosting, bagging, and Hoeffding                     SAMOA to integrate more DSPEs into
Trees, all with and without Naı̈ve Bayes classi-                   A PACHE SAMOA.
fiers at the leaves. Also it implements regression,
and frequent pattern methods. MOA supports bi-              4     Conclusions
directional interaction with WEKA, the Waikato              Mining Internet of Things (IoT) Big Data streams
Environment for Knowledge Analysis, and is re-              is a challenging task, that needs new tools to per-
leased under the GNU GPL license.                           form the most common machine learning algo-
                                                            rithms such as classification, clustering, and re-
3   A PACHE SAMOA                                           gression.
A PACHE SAMOA (S CALABLE A DVANCED M AS -                      A PACHE SAMOA is is a platform for min-
SIVE O NLINE A NALYSIS ) is a platform for min-             ing big data streams, and it is already avail-
ing big data streams (Morales and Bifet, 2015). As          able and can be found online at http://www.
most of the rest of the big data ecosystem, it is           samoa-project.net. The website includes a
written in Java.                                            wiki, an API reference, and a developer’s manual.
   A PACHE SAMOA is both a framework and a li-              Several examples of how the software can be used
brary. As a framework, it allows the algorithm de-          are also available.
veloper to abstract from the underlying execution
                                                            Acknowledgments
engine, and therefore reuse their code on differ-
ent engines. It features a pluggable architecture           The presented work has been done in collaboration
that allows it to run on several distributed stream         with Gianmarco De Francisci Morales, Nicolas
processing engines such as Storm, S4, and Samza.            Kourtellis, Bernhard Pfahringer, Geoff Holmes,
This capability is achieved by designing a minimal          Richard Kirkby, and all the contributors to MOA
API that captures the essence of modern DSPEs.              and A PACHE SAMOA.
This API also allows to easily write new bindings
to port A PACHE SAMOA to new execution engines.
A PACHE SAMOA takes care of hiding the differ-              References
ences of the underlying DSPEs in terms of API               Gianmarco De Francisci Morales and Albert Bifet.
and deployment.                                               2015. SAMOA: Scalable Advanced Massive Online
   As a library, A PACHE SAMOA contains imple-                Analysis. Journal of Machine Learning Research,
                                                              16:149–153.
mentations of state-of-the-art algorithms for dis-
tributed machine learning on streams. For classi-           Albert Bifet, Geoff Holmes, Richard Kirkby, and Bern-
fication, A PACHE SAMOA provides a Vertical Ho-               hard Pfahringer. 2010. MOA: Massive Online
                                                              Analysis. Journal of Machine Learning Research,
effding Tree (VHT), a distributed streaming ver-              11:1601–1604, August.
sion of a decision tree. For clustering, it includes
an algorithm based on CluStream. For regression,            Doug Laney. 2001. 3-D Data Management: Con-
                                                              trolling Data Volume, Velocity and Variety. META
HAMR, a distributed implementation of Adaptive
                                                              Group Research Note, February 6.
Model Rules. The library also includes meta-
algorithms such as bagging and boosting.
   The platform is intended to be useful for both
research and real world deployments.



                                                       16