=Paper=
{{Paper
|id=Vol-2357/paper13
|storemode=property
|title=IoT-Hub: New IoT Data-Platform for Virtual Research Environments
|pdfUrl=https://ceur-ws.org/Vol-2357/paper13.pdf
|volume=Vol-2357
|authors=Rosa Filgueira,Rafael Ferreira Da Silva,Ewa Deelman,Vyron Christodoulou,Amrey Krause
|dblpUrl=https://dblp.org/rec/conf/iwsg/FilgueiraSDCK18
}}
==IoT-Hub: New IoT Data-Platform for Virtual Research Environments==
<pdf width="1500px">https://ceur-ws.org/Vol-2357/paper13.pdf</pdf>
<pre>
                       10th International Workshop on Science Gateways (IWSG 2018), 13-15 June 2018


       IoT-Hub: New IoT Data-Platform for Virtual
                Research Environments
         Rosa Filguiera∗ , Rafael Ferreira da Silva‡ , Ewa Deelman‡ , Vyron Christodoulou§ , Amrey Krause∗
                ∗ University of Edinburgh, EPCC, Edinburgh, UK. Email: {r.filgueira, a.krause}@epcc.ed.ac.uk
           ‡ University of Southern California, ISI, Marina Del Rey, CA, USA. Email: {rafsilva, deelman}@isi.edu
                   § British Geological Survey, The Lyell Centre, Edinburgh, UK. Email: vyronc@bgs.ac.uk


   Abstract—This paper presents IoT-Hub a new scalable, elas-         last years. However, that is not the case for the VREs, IoT,
tic, efficient, and portable Internet of Things (IoT) data-platform   and all new middleware for emerging data-intensive analytics.
based on microservices for monitoring and analysing large-               In this paper, we present IoT-Hub, an integrated, com-
scale sensor data in real-time. IoT-Hub allows us to collect,
process, and store large amounts of data from multiple sensors in     prehensive, elastic, and portable data-platform based on mi-
distributed locations—which could be deployed as a backend for        croservices. IoT-Hub combines the benefits of several well-
Virtual Research Environments (VRE) or Science Gateways. In           known data-frameworks with Docker containers. The cur-
the proposed data-platform, all required software, which involves     rent implementation of IoT-Hub includes a service-pipeline
a variety of state-of-the-art open-source middleware, is packed       composed by Apache Kafka, Apache Spark, Elasticsearch,
into containers and deployed in a cloud environment. As a result,
the engineering and computational time and costs for deployment       and Kibana middleware that enables automated gathering,
and execution is significantly reduced.                               preprocessing, storing, and visualization of IoT streams in
   Keywords—IoT, Science Gateway, Virtual Research Environ-           a scalable, efficient, and robust manner. IoT-Hub acts as a
ment, Data-Frameworks, Containers, Data Science, Microservices        backend for VREs to run stream-based applications, deploying
                                                                      cloud resources upon request. It reduces the engineering time
                                                                      and effort (and possible human errors) required by scientists
                       I. I NTRODUCTION                               or VRE administrators to build such complex systems. Our
                                                                      hypothesis is that if we provide scientific communities with
   The emergence of the Internet of Things (IoT) is introducing       portable and elastic platforms to interrogate the IoT data, it
a new era to the realm of computing and technology [1]. The           will speed up scientific discoveries.
proliferation of sensors and actuators that are embedded in              We have demonstrated the feasibility of IoT-Hub via a
things enables these devices to understand the environments           real use case application, which processes sensor data from
and respond accordingly more than ever before. Additionally,          the British Geological Survey (BGS) environmental baseline
it opens unlimited possibilities to domain scientists and/or          programme [4] (freely available online). IoT-Hub collects,
data scientists for building models and analyses that turn            preprocesses, and stores in real-time time-series data from sev-
this sensation into big benefits to science and society. Real-        eral distributed locations and sensors types, and makes them
time processing of big data streams will gain importance as           available to domain scientists (e.g. groundwater modelers) and
embedded technology increases and we continue to generate             data scientists, so they can use it to build their models, make
new types and methods of data analysis [2], particularly in           predictions, and conduct analyses.
regard to IoT. However, this revolutionary spread of IoT                 This paper is structured as follows. Section II presents back-
devices creates big challenges, such as choosing, deploying,          ground. Section III discusses IoT-Hub features. Section IV
and managing adequate data-frameworks for data-intensive              presents the use case for testing the platform. We conclude
computation in science, engineering, and many other fields.           with a summary of achievements and outline future work.
   Virtual Research Gateways (VREs), also known as Science
Gateways [3], are web-tools accessible from anywhere. They                      II. BACKGROUND AND R ELATED W ORK
usually provide an integrated view of all available resources            In this section, we provide a brief overview on the state-
with pervasive data access control, handle continuity between         of-the-art encompassing VREs, IoT, and middleware for data-
sessions and support collaboration with shared data and meth-         Intensive analytics.
ods. VREs open up opportunities for sharing and comparing
both experiment data from experiments, observations, and              A. Virtual Research Environments (VREs)
model runs and analytic interpretations of these data. They              VREs can be defined as community-development set of
are very popular in a variety of scientific communities (e.g.         tools, applications, and data that is integrated via a portal or a
seismology or astronomy) since they hide many technical and           suite of applications, usually in a graphical user interface, that
management details whose use are not straightforward for non-         is further customized to meet the needs of a specific commu-
experts. The connection to and between VREs and science               nity [5]. These tools sit behind the scenes and exploit a wealth
automation technologies has gained a lot of attention in the          of resources residing on multiple computing infrastructures
                      10th International Workshop on Science Gateways (IWSG 2018), 13-15 June 2018


and data providers (according to their policies). Some VREs              data processing facilities, e.g., SWMS rely on distributed
examples include:                                                        computing infrastructures to actually execute their constituent
  • VERCE [6] is a data-intensive e-science environment to               tasks.
    enable innovative data analysis and data modeling meth-                 The proposed framework (IoT-Hub) can be used as a VRE
    ods that fully exploit the increasing wealth of open data            backend, where scientists can simply inject and execute their
    generated by the observational and monitoring systems                processing analyses (via VRE fronted) without putting effort
    of the global seismology community.                                  in operating the enabling technology. In order to meet this
  • MoSGrid [7] is a portal that offers an approach to carrying          goal, we have leveraged Docker containers, which allows us
    out high-quality molecular simulations on distributed                to have an elastic computational environment based in loosely
    compute infrastructures to scientists with all kinds of              coupled services, which are immediately portable. Docker
    background and experience levels.                                    handles the packaging and execution of a container so that
  • CyberSKA [8] is a collaborative portal which aims to                 it works identically across different machines, while exposing
    address the current and future needs of data-intensive               the necessary interfaces for networking ports, volumes, and
    radio astronomy. A wide variety of tools and services that           so forth, allowing other users to reconstruct an equivalent
    have been developed and integrated with the CyberSKA                 computational environment. Therefore, IoT-Hub can be de-
    portal, including a distributed data management system,              ployed on demand (as-a-service) reducing engineering time
    a data access tool, remote visualization tools, and third            and computational cost.
    party applications.
  • EFFORT [9] is an innovative platform to promote persis-              B. Internet of Things (IoT): Big Data challenges
    tent collaboration research in Rock Physics and Volcanol-
    ogy. It organizes data from rock physics experiments and                The explosive increase in the number of devices connected
    volcano monitoring to open up opportunities for sharing              to the IoT and the exponential increase in data consumption
    and comparing data, observations and model runs, and                 only reflect how the growth of big data perfectly overlaps with
    analytical interpretation methods.                                   that of IoT [15]. And therefore, many architectural design
  • myExperiment [10] is a portal for collaboration and                  challenges have arisen for the delivery of big data services
    sharing of workflows and experiments. In contrast to                 based on the IoT. These challenges have been described in
    systems that simply make workflows available, it pro-                detail in [16]. In this work, we have mainly focused in the
    vides mechanisms to support the sharing of workflows                 following ones:
    within and across multiple communities via a social web                • The number of IoT devices: With growth forecasted in
    approach.                                                                the number of connected “things” and expected to reach
  Having a closer look to the technologies, tools, systems,                  billions world-wide, there will be masses of devices
and computing resources that are very often behind VREs’                     which may be a data source, and which may be subject
backends, we can categorize them as follows:                                 to third party control;
  •  High Performance Computing solutions: aggregated com-                 • Risk of IoT device malfunction: With a great number

     puting resources to perform high performance computa-                   of IoT devices and manufacturers it is reasonable to
     tions (including processors, memory, disk, and operating                assume there will be many occasions where IoT devices
     system) [11];                                                           malfunction in various ways;
   • Distributed Computing Infrastructures: distributed sys-               • Update frequency: Though some devices will produce

     tems characterized by heterogeneous networked comput-                   data reports at a low frequency there may be substantial
     ers called to offer data processing facilities. This includes           quantities of data streaming from more sophisticated
     high-throughput computing and cloud computing;                          Internet connected things.
   • Scientific workflow management systems (SWMS): sys-                    Therefore, IoT-Hub has been designed to collect data
     tems enacting the definition and execution of scien-                from a wide range of different IoT devices, geographically
     tific workflows consisting of a list of tasks and op-               distributed that stream data at different frequency ratios, and
     erations, the dependencies between the interconnected               could yield malfunction behaviors intermittently.
     tasks, control-flow structures, and the data resources to              Our proposed solution provides an IoT data-platform, which
     be processed [12], [13];                                            support an ecosystem of third party application developers
   • Data analytics frameworks and platforms: platforms and              (e.g.domain scientists and data scientists) to explore data
     workbenches enabling scientists to execute analytic tasks.          given the described challenges. IoT-Hub offers a degree of
     Such platforms tend to provide their users with imple-              flexibility making use of Docker and Docker-compose tools
     mentations of algorithms and (statistical) methods for the          for deploying services on demand in virtualized environments,
     analytics tasks [14].                                               such as cloud systems. Previous works have targeted similar
   These classes of solutions and approaches are not isolated,           environments, such as [17], which is a cloud-based autonomic
rather they are expected to rely on each other to provide                information system for delivering Agriculture-as-a-Service
VREs end users with easy to use, efficient, and effective                (AaaS) through the use of cloud and big data technologies.


                                                                     2
                              10th International Workshop on Science Gateways (IWSG 2018), 13-15 June 2018


C. Microservices & Middleware
                                                                                                                    Data Processing/                               Data Analysis /
                                                                                      Events / Data Streams                                        Data Storage
                                                                                                                     Data Cleaning                                 Visualizations
   With recent advances in cloud computing, virtualization,
containerization, continuous integration, and the DevOps                                  Location 1
movement, deploying software solutions today is very dif-                                 SA    SB

ferent from even just a few years ago. Today’s distributed                                SC

applications are built as a set of independently deployable
microservices distributed over clusters of commodity hard-                               Location 2
                                                                                                                 Apache          Apache
ware. The microservice term – also known as the microservice                             SA     SB
                                                                                                                  Kafka           Spark            ElasticSearch       Kibana
                                                                                         SC
architecture – refers to a new architectural style that structures
an application as a collection of loosely coupled services,
which implement business capabilities. So, we have built
                                                                                          Location 3
IoT-Hub following microservice principles [18].
                                                                                           SA    SB
   The question that arises now is, which components should                                SC
                                                                                                                                       Cloud Infrastructure

be used in IoT-Hub to develop a high performance platform                                  Falcon
to efficiently analyze IoT big data. For answering this ques-                           data generation                   data preparation and storage               applications

tion, we developed a prototype platform in an elastic cloud
environment, and Falcon1 , Apache Kafka2 , Apache Spark3 ,
Elasticsearch4 , Kibana5 and Docker6 have been initially se-
                                                                                                                                                                      Scientists
                                                                                           Data
lected (see Table I). This selection could be easily extended                           Producers                               Data                                   (Domain
                                                                                                                             Engenieers                               Scientists /
                                                                                         (Domain
in the future for including additional data-frameworks, such                            Scientists)
                                                                                                                                                                         Data
                                                                                                                                                                      Scientists)
as Cassandra, Apache Flink, and Jupyter Notebooks.
                                                                                  Fig. 1: IoT-Hub: Data-platform for gathering, quality check-
TABLE I Overview of software that conforms IoT-Hub.                               ing, storing, and visualizing environmental sensors streams.
 Technology           Description                                  Version
 Falcon               Reliable, high-performance Python web        2.0                                        III. I O T-H U B FEATURES
                      framework for building large-scale app
                      backends and microservices. It encour-                         IoT-Hub integrates several middleware based on the mi-
                      ages the REST architectural style with
                      minimal external dependencies, while                        croservices architecture. Apache Kafka provides the mecha-
                      remaining highly effective.                                 nism for ingesting real-time data streams and making them
 Apache Kafka         Distributed streaming platform that al-      0.10.2.0       available to downstream consumers in a parallel and fault-
                      lows for publishing and subscribing to
                      streams of records (topics) in a fault-                     tolerant manner. Data in Apache Kafka is organized into topics
                      tolerant way and process streams of                         that are split into partitions for parallelism. A topic can be
                      records as they occur.                                      viewed as an infinite stream where data is retained for a
 Apache Spark         Fast and general engine for large-scale      2.2.0
                      data processing. Among other features,                      configurable amount of time. Producers are applications that
                      it allows writing streaming jobs the                        publish stream of records to one or more topics. In our case,
                      same way as writing batch jobs. It sup-                     Apache Kafka streams events out to Apache Spark consumers
                      ports Java, Scala and Python.
 Elasticsearch        A distributed, RESTful search and an-        6.2.2          which are subscribed to one or more topics for parsing their
                      alytics engine for performing and com-       (oss)          content, all of which is done in near real-time.
                      bining many types of searches struc-
                      tured, unstructured, geo or metric.
                                                                                     Spark Streaming API enables scalable, high-throughput,
 Kibana               An open source data visualization plu-       6.2.2          fault-tolerant stream processing of live data streams. Data
                      gin for Elasticsearch. It provides visual-   (oss)          can be ingested from many sources (e.g. Apache Kafka,
                      ization capabilities on top of the content
                      indexed on an Elasticsearch cluster. It
                                                                                  Flume, Twitter), but in this current version of IoT-Hub we
                      also supports remote I/O                                    limited it to Apache Kafka. Spark Streaming API can be used
 Docker               A lightweight, stand-alone, executable       1.13.1         for processing the ingested data using complex algorithms
                      package of a piece of software that
                      includes everything needed to run it:
                                                                                  composed of high-level functions like map, reduce, join,
                      code, runtime, system tools, system li-                     and window. The processed data can be published to yet
                      braries, settings.                                          another Kafka topic for further consumption or it can be stored
                                                                                  as results in HDFS, databases, or dashboards. In this work, we
                                                                                  have selected Elasticsearch as temporary storage system. One
  1 https://falconframework.org/
                                                                                  of the reasons for this choice is elasticsearch-hadoop provides
  2 https://kafka.apache.org/
  3 https://spark.apache.org/
                                                                                  native integration between Elasticsearch and Apache Spark, in
  4 https://www.elastic.co/                                                       the form of an RDD (Resilient Distributed Dataset).
  5 https://www.elastic.co/products/kibana                                           Kibana offers interactive visualizations (e.g. histograms,
  6 https://www.docker.com/                                                       line graphs, pie charts, sunbursts, etc.) and advanced time


                                                                              3
                         10th International Workshop on Science Gateways (IWSG 2018), 13-15 June 2018


series analysis on Elasticsearch data, by leveraging the full
                                                                                           Lancashire - Groundwater sensors- boreholes from 1 to 5
aggregation capabilities of Elasticsearch.
   In order to have a portable, scalable, and elastic data-
platform, we created a Docker cluster for each of the previous
middleware and connected them via docker-compose, since it
allows us to setup and run multi-container environments. Fig-
ure 1 shows a visual description of the IoT-Hub components,
and the interactions from different roles (e.g. data producers,
data architects, data scientists, domain scientists) that we an-
ticipate. All Dockerfiles and docker-compose file
                                                                                    Vale of Pickering - Groundwater sensors - boreholes from 7 to 10
used to generate the IoT-Hub are available freely online in a
GitHub repository7 for allowing reproducibility and share our
approach among the scientific community.
   For our experiments, we have used the NSF-Chameleon
cloud8 , using a CentOS7 image with 42 CPUS for deploying
our hub. Note that the proposed framework could be deployed
to any other Cloud system.

      IV. C ASTE STUDY: E NVIRONMENTAL BASELINE
               M ONITORING P ROGRAMME                                  Fig. 2: Groundwater sensors from the areas of Lancashire
   To demonstrate the feasibility of IoT-Hub, we have used             and Vale of Pickering. In total, we have 9 boreholes (sensors
the Environmental Baseline Monitoring programme [4], which             are attached to boreholes) placed at geographically distributed
provides the perfect scenario for testing our platform with IoT        locations.
sensor data. The British Geological Survey (BGS), along with
partners from the Universities of Manchester, York, Birming-
ham, Bristol, and Public Health England (PHE), is carrying                We have initially focused on Groundwater quality sensor
out a science-based environmental monitoring programme in              data. However, very little work has to be done in IoT-Hub to
the areas of Lancashire [19] and Vale of Pickering [20]. This          enable support to other sensors. This is discussed in Section V.
programme represents the first independent, integrated moni-           These groundwater sensors are attached to boreholes, which
toring study to characterize the environmental baseline in areas       are called emb1, emb2, . . . emb10. Figure2 shows the
subjected to close scrutiny in anticipation of the development         locations of these boreholes. For simplicity, we have selected
of a nascent UK shale-gas industry. The monitoring involves            emb2, emb3, and emb4 boreholes, but IoT-Hub supports
ways of managing high volume, highly varied data, generated            any number of boreholes and sensors.
by a range of IoT sensor data, including:                                 IoT-Hub collects in simulated real-time the water-quality
  • Groundwater quality: The sensors installed in the bore-            parameters described before, from sensors attached to the
    holes provide real-time measurements of water-quality              selected boreholes (marked as 2, 3, and 4 in Figure 2).
    parameters: water level, temperature, pH, conductivity,            Since we did not have direct access to these sensors, but
    and dissolved gases (O2, CH4, CO2, Rn).                            access to yearly compressed files instead, monthly datasets
  • Seismicity: The monitoring of background seismicity has            were downloaded locally [19]. These datasets are originally
    involved installation of a network of seismic stations in          in comma separated values (CSV) format and sensors provide
    the vicinity of the proposed shale gas wells. Real time            a single reading for every hour (every line corresponds to an
    seismic data are being collected from the array of stations        hour reading) per each day. In this work, we simulate the
    to help characterize current levels of seismic activity. The       setup where sensor data corresponding to one hour arrives
    information captured in near real-time includes: station           every 10 seconds to test the ability of IoT-Hub to deal with
    code, station name, seismic data (for a single channel),           high frequency data transmission. To achieve this, a feeder
    and timestamp.                                                     script was implemented to send POST requests messages (with
  • Air composition: The monitoring equipment measures                 a sensor reading) to Falcon web services every 10 seconds.
    concentrations of ozone (O3), particulate matter (PM1,             Then, Falcon was configured to act as a producer, publishing
    PM2.5, PM4, and PM10), nitrogen oxides (NO, NO2                    streams to Apache Kafka by using the emb topic. Apache Kafka
    and NOx), methane (CH4), non-methane hydrocarbons                  ingests those streams in real-time and makes them available
    (NHMCs), hydrogen sulphide (H2S) and carbon dioxide                to a Spark-Streaming application which acts as a consumer.
    (CO2) as well as capturing meteorological information.             This application is subscribed to the emb topic, and stores
                                                                       the data in Elasticsearch (see Listing 1) after performing a
  7 https://github.com/rosafilgueira/EMB datastreaming                 quality check over the values received (e.g. if the data is
  8 https://www.chameleoncloud.org/                                    within the range specified by the sensor manufacturers). If


                                                                   4
                                   10th International Workshop on Science Gateways (IWSG 2018), 13-15 June 2018


the values are within the predefined range, all the data is                                                     2) Minimal or no parameter tuning.
stored into the corresponding fields and the normal label                                                       3) Both globalized and localized AD in evolving time
is stored in the qc field. Otherwise, the data is stored and                                                        series.
annotated with the anomaly label instead. Note that Apache                                                      The criteria address two main problems in the AD domain.
Kafka offers a decoupling and buffering of the input streams.                                                First, the monitoring of the network for the detection of
Therefore producer-consumer need not to know about each                                                      anomalies without a-priori knowledge. Any selected method
other. From multiple sources producers can write data to any                                                 should be able to perform well both in global and local AD.
topic in kafka, and several consumers can be subscribed to the                                               Detecting localized anomalies can be utilized as a warning
same topic making each of them different processing analysis.                                                system to prevent device failures. As a consequence, this
Since kafka persists data to disc, consumers can be shut down                                                delivers more consistent up-time, improved data quality, and
for performing changes and when they are restarted, they will                                                thereby results in a more robust system. Second, the moni-
retrieve all data from the time they were offline.                                                           toring of global trends that change over time and detect any
                                                                                                             unusual seasonal patterns. This can help in improving the
Listing 1: Elasticsearch index for storing groundwater values.                                               understanding about the nature of seasonal patterns or changes
index : emb test
t y p e : emb                                                                                                in research environments.
fields :                                                                                                        Due to the aforementioned reasons, the algorithm selected
    s e n s o r i d −> t y p e : t e x t , d e s c r i p t i o n : I d o f t h e s e n s o r
    d a t e −> t y p e : d a t e , d e s c r i p t i o n : D a t e (UTC)                                     is Twitter’s Seasonal Hybrid Extreme Studentized Deviate (S-
    t i m e −>t y p e : d a t e , d e s c r i p t i o n : Time (UTC)                                         H-ESD) AD algorithm [21] that uses robust statistics with a
    s e c −> t y p e : i n t e g e r , d e s c r i p . : M i c r o s i e m e n s p e r c e n t i m e r
    ph −> t y p e : f l o a t , d e s c r i p t i o n : PH                                                   focus on analyzing long term and short trends in time series.
    w a t e r l e v e l −> t y p e : f l o a t , d e s c r i p t i o n : Water l e v e l aOD                 The underlying algorithm employs time series decomposition
    w a t e r t e m p−> t y p e : f l o a t , d e s c r i p t i o n : Water t e m p e r a t u r e
    t d g −> t y p e : i n t e g e r , d e s c r i p t i o n : T o t a l d i s s o l v e d g a s             to detect both global and local anomalies. A combination of
    qc −> t y p e : t e x t , d e s c r i p t i o n : Q u a l i t y c o n t r o l                            piecewise approximation to extract the trend of a time series
                                                                                                             and an ESD test for anomalies is performed to accommodate a
   Once sensor data becomes available in Elasticsearch, dif-
                                                                                                             more localized AD. What is of particular interest is the ability
ferent applications can be run to analyze it. These applications
                                                                                                             to detect both local and global anomalies or seasonal changes
can vary from simple scripts to check the insertion of data, to
                                                                                                             and identify when a different pattern emerges in a continuous
more complex machine learning analysis, such as the anomaly
                                                                                                             data flow.
detection that will be explained in the following subsection.
                                                                                                                The only algorithm parameter that was pre-configured was
   Furthermore, IoT-Hub includes Kibana for exploring the
                                                                                                             the maximum anomaly (set to 0.2), m, that controls the al-
data visually (see Figure 3 as an example).
                                                                                                             gorithm’s upper bound of suspected anomalies. The algorithm
                                                                                                             ran for each of the different measurements. For brevity, in
                                                                                                             Figure 4 only the most interesting AD case is shown. In this
                                                                                                             case, the first (left) and the second run (right) of the algorithm
                                                                                                             after 15 minutes are shown. The first figure shows normal
                                                                                                             deviation and no anomalies are detected. In the second run,
                                                                                                             the algorithm picks up the anomalies that show a small spike
                                                                                                             before the sensor stops provides readings of value zero.

                                                                                                                                             SEC AD                                                 SEC AD
                                                                                                                      845.2                                                    1100


                                                                                                                                                                                      900
                                                                                                                                                                              S/cm)


                                                                                                                           845
                                                                                                                                                                                 )
                                                                                                             SEC(uS/cm)


                                                                                                                                                                       SEC (S/cm


                                                                                                                          844.7                                                       700
Fig. 3: Kibana screenshot for filtering the normal values, and
                                                                                                                                                                      SEC(u


counting the different values of the water level parameter.                                                               844.5                                                       400


                                                                                                                          844.2                                                       200

A. Anomaly Detection                                                                                                       844                                                         0
                                                                                                                             Feb 1   Feb 2    Feb 3   Feb 4   Feb 5                         Feb 6    Feb 7   Feb 8   Feb 9
   An anomaly detection (AD) algorithm has been imple-
mented to periodically interrogate the data that is automati-                                                Fig. 4: IoT-Hub: AD results in emb3, Lancashire sensor for
cally collected, processed, and stored by IoT-Hub. The AD                                                    February 2017. First run (left), second run (right). Detected
algorithm is called every 15 minutes to review all the data                                                  anomalies are shown as turquoise dots.
transmitted in that period. The IoT-Hub presents new chal-
lenges in an AD context due to its continuous data streaming.                                                  The performance of the method is promising in a real world
   Therefore, we have established a three criteria that an                                                   scenario for AD. However, as can be seen by the high number
algorithm must fulfill in IoT-Hub:                                                                           of anomalies there is always the caveat of too many false
   1) Robustness in seasonal changes (i.e., weather patterns).                                               positives, which is something to be avoided in real world


                                                                                                         5
                       10th International Workshop on Science Gateways (IWSG 2018), 13-15 June 2018


scenarios. The parameter selection has also to be evaluated                 European e-Infrastructures”, EU H2020 funded project (No 777413).
periodically. Careful consideration of algorithms that fulfill the          We thank the NSF Chameleon Cloud for providing time grants to
above criteria with a focus on precision rather than recall have            access their resources. This work contains British Geological Survey
                                                                            materials NERC [2017 and 2018 years].
to be chosen in order to be able to deliver robust results.
   In the future, we plan to create a warning system (e.g., via                                           R EFERENCES
a VRE fronted) that makes use of IoT-Hub for running the                     [1] J. Gubbi, R. Buyya, S. Marusic, and M. Palaniswami, “Internet of things
described AD analysis to interpret sensors in the field. This                    (iot): A vision, architectural elements, and future directions,” Future
system will send out alerts to domain scientists subscribed to                   Gener. Comput. Syst., vol. 29, no. 7, pp. 1645–1660, 2013.
                                                                             [2] M. Atkinson and M. Parsons, “The digital-data challenge,” in The DATA
these alerts, as well as to those in charge of deploying and                     Bonanza – Improving Knowledge Discovery for Science, Engineering
maintaining the sensors in the field.                                            and Business, M. P. Atkinson et al., Eds. Wiley, 2013, ch. 1, pp. 5–13.
                                                                             [3] L. Candela, D. Castelli, and P. Pagano, “Virtual research environments:
           V. C ONCLUSIONS AND F UTURE W ORK                                     An overview and a research agenda,” Data Science Journal, vol. 12, pp.
                                                                                 GRDI75–GRDI81, 2013.
   In this paper, we have presented IoT-Hub, which delivers                  [4] “Instrumenting the earth,” http://www.bgs.ac.uk/Sensors/.
specialized data-framework to exploit the IoT in a scalable,                 [5] C. E. Catlett, “Teragrid: A foundation for us cyberinfrastructure,” in
efficient, and robust manner reducing the engineering time                       Proceedings of the 2005 IFIP International Conference on Network and
                                                                                 Parallel Computing (NPC’05), 2005, pp. 1–1.
and computational cost. We have demonstrated the feasi-                      [6] M. Atkinson, M. Carpene, E. Casarotti, S. Claus, R. Filgueira et al.,
bility of the proposed solution by using the Environmental                       “Verce delivers a productive e-science environment for seismology
Baseline Monitoring programme. Data from different sensors                       research,” in 2015 IEEE 11th International Conference on e-Science
                                                                                 (e-Science)(E-SCIENCE), vol. 00, 2015, pp. 224–236.
are collected, preprocessed, and stored in real-time using                   [7] L. de la Garza, J. Krger, C. Schrfe, M. Rttig, S. Aiche, K. Reinert, and
different microservices middleware. To test the capacity of                      O. Kohlbacher, “From the desktop to the grid: conversion of knime
IoT-Hub for running complex data-analytics tasks, we have                        workflows to guse,” in Proc. IWSG 2013, ser. CEUR-WS, vol. 993,
                                                                                 2013, p. 9. [Online]. Available: http://ceur-ws.org/Vol-993/
implemented an anomaly detection algorithm, which queries                    [8] C. Kiddle et al., “Cyberska: An on-line collaborative portal for data-
data from Elasticsearch and detects the anomalies of each of                     intensive radio astronomy,” in 2011 ACM Workshop on Gateway Com-
the water-quality parameters. All the middleware that forms                      puting Environments (GCE ’11), 2011, pp. 65–72.
                                                                             [9] R. Filgueira et al., “escience gateway stimulating collaboration in rock
IoT-Hub has been containerized, which enables flexible and                       physics and volcanology,” in 2014 IEEE 10th International Conference
agile development, and deployment in cloud-based infrastruc-                     on e-Science - Volume 01, ser. E-SCIENCE ’14, 2014, pp. 187–195.
tures.                                                                      [10] D. De Roure, C. Goble, and R. Stevens, “The design and realisation of
                                                                                 the myExperiment Virtual Research Environment for social sharing of
   The current version of IoT-Hub has been pre-configured                        workflows,” Future Generation Computer Systems, vol. 25, no. 5, pp.
for working with groundwater sensors. To extend this work to                     561–567, 2009.
other sensors, it only would require to: (1) create a new Kafka             [11] G. Hager and G. Wellein, Introduction to High Performance Computing
                                                                                 for Scientists and Engineers, 1st ed. Boca Raton, FL, USA: CRC Press,
topic to produce and consume new datasets; (2) modify the                        Inc., 2010.
Spark-Streaming application to consume and check the data                   [12] I. J. Taylor, E. Deelman, D. B. Gannon, and M. Shields, Workflows for
from this new topic; and (3) create a new Elasticsearch index.                   e-Science: scientific workflows for grids. Springer Publishing Company,
                                                                                 Incorporated, 2014.
   One of the main uses of IoT-Hub could be to act as a                     [13] J. Liu, E. Pacitti, P. Valduriez, and M. Mattoso, “A survey of
backend for VREs or Scientific Gateways for running data-                        data-intensive scientific workflow management,” J. Grid Comput.,
intensive applications and deploying cloud resources upon                        vol. 13, no. 4, pp. 457–493, Dec. 2015. [Online]. Available:
                                                                                 http://dx.doi.org/10.1007/s10723-015-9329-8
request.                                                                    [14] C.-W. Tsai, C.-F. Lai, H.-C. Chao, and A. V. Vasilakos, “Big data
   As future work, we plan to include more middleware in                         analytics: a survey,” Journal of Big Data, vol. 2, no. 1, p. 21, 2015.
IoT-Hub, such as Cassandra database (for high performance                   [15] E. Ahmed, I. Yaqoob, I. A. T. Hashem, I. Khan, A. I. A. Ahmed,
                                                                                 M. Imran, and A. V. Vasilakos, “The role of big data analytics in internet
operations and handling massive datasets), an RDF repository                     of things,” Comput. Netw., vol. 129, no. P2, pp. 459–471, 2017.
(to store and harvest RDF data), SparQL Endpoint (to query                  [16] “Iot big data framework architecture,” https://www.gsma.com/iot/
a knowledge base via the SPARQL language), a job submis-                         wp-content/uploads/2016/11/CLP.25-v1.0.pdf.
                                                                            [17] S. S. Gill, I. Chana, and R. Buyya, “Iot based agriculture as a cloud
sion system (to submit applications to distributed computing                     and big data service: The beginning of digital india,” JOEUC, vol. 29,
infrastructures), and Jupyter Notebook (to offer an interactive                  no. 4, pp. 1–23, 2017.
computational environment).                                                 [18] I. Nadareishvili, R. Mitra, M. McLarty, and M. Amundsen, Microser-
                                                                                 vice Architecture: Aligning Principles, Practices, and Culture, 1st ed.
   Additionally, we also plan to create a warning system that                    O’Reilly Media, Inc., 2016.
makes use of IoT-Hub for running the described AD analysis                  [19] “Environmental       baseline    monitoring     in    the     lancashire,”
to interpret sensors in the field, and leverage IoT-Hub capa-                    http://www.bgs.ac.uk/research/groundwater/shaleGas/monitoring/
                                                                                 lancsDataSummary.html.
bilities to process near real-time logs from scientific workflow            [20] “Environmental baseline monitoring in the vale of pickering,”
executions [22].                                                                 http://www.bgs.ac.uk/research/groundwater/shaleGas/monitoring/
                                                                                 vopDataSummary.html.
Acknowledgments. This work was carried out when the lead author             [21] O. Vallis, J. Hochenbaum, and A. Kejariwal, “A novel technique for
was with the British Geological Survey. It was funded under the Scot-            long-term anomaly detection in the cloud,” in 6th USENIX Conference
tish Informatics and Computer Science Alliance with the Postdoctoral             on Hot Topics in Cloud Computing (HotCloud’14), 2014, pp. 15–15.
and Early Career Researcher Exchanges fellowship, partially funded          [22] E. Deelman et al., “PANORAMA: An approach to performance mod-
by DOE under Contract DESC0012636, “ Panorama - Predictive                       eling and diagnosis of extreme scale workflows,” International Journal
Modeling and Diagnostic Monitoring of Extreme Science Work-                      of High Performance Computing Applications, vol. 31, no. 1, pp. 4–18,
flows”, and by “ DARE -Delivering Agile Research Excellence on                   2017.


                                                                        6

</pre>