=Paper=
{{Paper
|id=Vol-2357/paper13
|storemode=property
|title=IoT-Hub: New IoT Data-Platform for Virtual Research Environments
|pdfUrl=https://ceur-ws.org/Vol-2357/paper13.pdf
|volume=Vol-2357
|authors=Rosa Filgueira,Rafael Ferreira Da Silva,Ewa Deelman,Vyron Christodoulou,Amrey Krause
|dblpUrl=https://dblp.org/rec/conf/iwsg/FilgueiraSDCK18
}}
==IoT-Hub: New IoT Data-Platform for Virtual Research Environments==
10th International Workshop on Science Gateways (IWSG 2018), 13-15 June 2018
IoT-Hub: New IoT Data-Platform for Virtual
Research Environments
Rosa Filguiera∗ , Rafael Ferreira da Silva‡ , Ewa Deelman‡ , Vyron Christodoulou§ , Amrey Krause∗
∗ University of Edinburgh, EPCC, Edinburgh, UK. Email: {r.filgueira, a.krause}@epcc.ed.ac.uk
‡ University of Southern California, ISI, Marina Del Rey, CA, USA. Email: {rafsilva, deelman}@isi.edu
§ British Geological Survey, The Lyell Centre, Edinburgh, UK. Email: vyronc@bgs.ac.uk
Abstract—This paper presents IoT-Hub a new scalable, elas- last years. However, that is not the case for the VREs, IoT,
tic, efficient, and portable Internet of Things (IoT) data-platform and all new middleware for emerging data-intensive analytics.
based on microservices for monitoring and analysing large- In this paper, we present IoT-Hub, an integrated, com-
scale sensor data in real-time. IoT-Hub allows us to collect,
process, and store large amounts of data from multiple sensors in prehensive, elastic, and portable data-platform based on mi-
distributed locations—which could be deployed as a backend for croservices. IoT-Hub combines the benefits of several well-
Virtual Research Environments (VRE) or Science Gateways. In known data-frameworks with Docker containers. The cur-
the proposed data-platform, all required software, which involves rent implementation of IoT-Hub includes a service-pipeline
a variety of state-of-the-art open-source middleware, is packed composed by Apache Kafka, Apache Spark, Elasticsearch,
into containers and deployed in a cloud environment. As a result,
the engineering and computational time and costs for deployment and Kibana middleware that enables automated gathering,
and execution is significantly reduced. preprocessing, storing, and visualization of IoT streams in
Keywords—IoT, Science Gateway, Virtual Research Environ- a scalable, efficient, and robust manner. IoT-Hub acts as a
ment, Data-Frameworks, Containers, Data Science, Microservices backend for VREs to run stream-based applications, deploying
cloud resources upon request. It reduces the engineering time
and effort (and possible human errors) required by scientists
I. I NTRODUCTION or VRE administrators to build such complex systems. Our
hypothesis is that if we provide scientific communities with
The emergence of the Internet of Things (IoT) is introducing portable and elastic platforms to interrogate the IoT data, it
a new era to the realm of computing and technology [1]. The will speed up scientific discoveries.
proliferation of sensors and actuators that are embedded in We have demonstrated the feasibility of IoT-Hub via a
things enables these devices to understand the environments real use case application, which processes sensor data from
and respond accordingly more than ever before. Additionally, the British Geological Survey (BGS) environmental baseline
it opens unlimited possibilities to domain scientists and/or programme [4] (freely available online). IoT-Hub collects,
data scientists for building models and analyses that turn preprocesses, and stores in real-time time-series data from sev-
this sensation into big benefits to science and society. Real- eral distributed locations and sensors types, and makes them
time processing of big data streams will gain importance as available to domain scientists (e.g. groundwater modelers) and
embedded technology increases and we continue to generate data scientists, so they can use it to build their models, make
new types and methods of data analysis [2], particularly in predictions, and conduct analyses.
regard to IoT. However, this revolutionary spread of IoT This paper is structured as follows. Section II presents back-
devices creates big challenges, such as choosing, deploying, ground. Section III discusses IoT-Hub features. Section IV
and managing adequate data-frameworks for data-intensive presents the use case for testing the platform. We conclude
computation in science, engineering, and many other fields. with a summary of achievements and outline future work.
Virtual Research Gateways (VREs), also known as Science
Gateways [3], are web-tools accessible from anywhere. They II. BACKGROUND AND R ELATED W ORK
usually provide an integrated view of all available resources In this section, we provide a brief overview on the state-
with pervasive data access control, handle continuity between of-the-art encompassing VREs, IoT, and middleware for data-
sessions and support collaboration with shared data and meth- Intensive analytics.
ods. VREs open up opportunities for sharing and comparing
both experiment data from experiments, observations, and A. Virtual Research Environments (VREs)
model runs and analytic interpretations of these data. They VREs can be defined as community-development set of
are very popular in a variety of scientific communities (e.g. tools, applications, and data that is integrated via a portal or a
seismology or astronomy) since they hide many technical and suite of applications, usually in a graphical user interface, that
management details whose use are not straightforward for non- is further customized to meet the needs of a specific commu-
experts. The connection to and between VREs and science nity [5]. These tools sit behind the scenes and exploit a wealth
automation technologies has gained a lot of attention in the of resources residing on multiple computing infrastructures
10th International Workshop on Science Gateways (IWSG 2018), 13-15 June 2018
and data providers (according to their policies). Some VREs data processing facilities, e.g., SWMS rely on distributed
examples include: computing infrastructures to actually execute their constituent
• VERCE [6] is a data-intensive e-science environment to tasks.
enable innovative data analysis and data modeling meth- The proposed framework (IoT-Hub) can be used as a VRE
ods that fully exploit the increasing wealth of open data backend, where scientists can simply inject and execute their
generated by the observational and monitoring systems processing analyses (via VRE fronted) without putting effort
of the global seismology community. in operating the enabling technology. In order to meet this
• MoSGrid [7] is a portal that offers an approach to carrying goal, we have leveraged Docker containers, which allows us
out high-quality molecular simulations on distributed to have an elastic computational environment based in loosely
compute infrastructures to scientists with all kinds of coupled services, which are immediately portable. Docker
background and experience levels. handles the packaging and execution of a container so that
• CyberSKA [8] is a collaborative portal which aims to it works identically across different machines, while exposing
address the current and future needs of data-intensive the necessary interfaces for networking ports, volumes, and
radio astronomy. A wide variety of tools and services that so forth, allowing other users to reconstruct an equivalent
have been developed and integrated with the CyberSKA computational environment. Therefore, IoT-Hub can be de-
portal, including a distributed data management system, ployed on demand (as-a-service) reducing engineering time
a data access tool, remote visualization tools, and third and computational cost.
party applications.
• EFFORT [9] is an innovative platform to promote persis- B. Internet of Things (IoT): Big Data challenges
tent collaboration research in Rock Physics and Volcanol-
ogy. It organizes data from rock physics experiments and The explosive increase in the number of devices connected
volcano monitoring to open up opportunities for sharing to the IoT and the exponential increase in data consumption
and comparing data, observations and model runs, and only reflect how the growth of big data perfectly overlaps with
analytical interpretation methods. that of IoT [15]. And therefore, many architectural design
• myExperiment [10] is a portal for collaboration and challenges have arisen for the delivery of big data services
sharing of workflows and experiments. In contrast to based on the IoT. These challenges have been described in
systems that simply make workflows available, it pro- detail in [16]. In this work, we have mainly focused in the
vides mechanisms to support the sharing of workflows following ones:
within and across multiple communities via a social web • The number of IoT devices: With growth forecasted in
approach. the number of connected “things” and expected to reach
Having a closer look to the technologies, tools, systems, billions world-wide, there will be masses of devices
and computing resources that are very often behind VREs’ which may be a data source, and which may be subject
backends, we can categorize them as follows: to third party control;
• High Performance Computing solutions: aggregated com- • Risk of IoT device malfunction: With a great number
puting resources to perform high performance computa- of IoT devices and manufacturers it is reasonable to
tions (including processors, memory, disk, and operating assume there will be many occasions where IoT devices
system) [11]; malfunction in various ways;
• Distributed Computing Infrastructures: distributed sys- • Update frequency: Though some devices will produce
tems characterized by heterogeneous networked comput- data reports at a low frequency there may be substantial
ers called to offer data processing facilities. This includes quantities of data streaming from more sophisticated
high-throughput computing and cloud computing; Internet connected things.
• Scientific workflow management systems (SWMS): sys- Therefore, IoT-Hub has been designed to collect data
tems enacting the definition and execution of scien- from a wide range of different IoT devices, geographically
tific workflows consisting of a list of tasks and op- distributed that stream data at different frequency ratios, and
erations, the dependencies between the interconnected could yield malfunction behaviors intermittently.
tasks, control-flow structures, and the data resources to Our proposed solution provides an IoT data-platform, which
be processed [12], [13]; support an ecosystem of third party application developers
• Data analytics frameworks and platforms: platforms and (e.g.domain scientists and data scientists) to explore data
workbenches enabling scientists to execute analytic tasks. given the described challenges. IoT-Hub offers a degree of
Such platforms tend to provide their users with imple- flexibility making use of Docker and Docker-compose tools
mentations of algorithms and (statistical) methods for the for deploying services on demand in virtualized environments,
analytics tasks [14]. such as cloud systems. Previous works have targeted similar
These classes of solutions and approaches are not isolated, environments, such as [17], which is a cloud-based autonomic
rather they are expected to rely on each other to provide information system for delivering Agriculture-as-a-Service
VREs end users with easy to use, efficient, and effective (AaaS) through the use of cloud and big data technologies.
2
10th International Workshop on Science Gateways (IWSG 2018), 13-15 June 2018
C. Microservices & Middleware
Data Processing/ Data Analysis /
Events / Data Streams Data Storage
Data Cleaning Visualizations
With recent advances in cloud computing, virtualization,
containerization, continuous integration, and the DevOps Location 1
movement, deploying software solutions today is very dif- SA SB
ferent from even just a few years ago. Today’s distributed SC
applications are built as a set of independently deployable
microservices distributed over clusters of commodity hard- Location 2
Apache Apache
ware. The microservice term – also known as the microservice SA SB
Kafka Spark ElasticSearch Kibana
SC
architecture – refers to a new architectural style that structures
an application as a collection of loosely coupled services,
which implement business capabilities. So, we have built
Location 3
IoT-Hub following microservice principles [18].
SA SB
The question that arises now is, which components should SC
Cloud Infrastructure
be used in IoT-Hub to develop a high performance platform Falcon
to efficiently analyze IoT big data. For answering this ques- data generation data preparation and storage applications
tion, we developed a prototype platform in an elastic cloud
environment, and Falcon1 , Apache Kafka2 , Apache Spark3 ,
Elasticsearch4 , Kibana5 and Docker6 have been initially se-
Scientists
Data
lected (see Table I). This selection could be easily extended Producers Data (Domain
Engenieers Scientists /
(Domain
in the future for including additional data-frameworks, such Scientists)
Data
Scientists)
as Cassandra, Apache Flink, and Jupyter Notebooks.
Fig. 1: IoT-Hub: Data-platform for gathering, quality check-
TABLE I Overview of software that conforms IoT-Hub. ing, storing, and visualizing environmental sensors streams.
Technology Description Version
Falcon Reliable, high-performance Python web 2.0 III. I O T-H U B FEATURES
framework for building large-scale app
backends and microservices. It encour- IoT-Hub integrates several middleware based on the mi-
ages the REST architectural style with
minimal external dependencies, while croservices architecture. Apache Kafka provides the mecha-
remaining highly effective. nism for ingesting real-time data streams and making them
Apache Kafka Distributed streaming platform that al- 0.10.2.0 available to downstream consumers in a parallel and fault-
lows for publishing and subscribing to
streams of records (topics) in a fault- tolerant manner. Data in Apache Kafka is organized into topics
tolerant way and process streams of that are split into partitions for parallelism. A topic can be
records as they occur. viewed as an infinite stream where data is retained for a
Apache Spark Fast and general engine for large-scale 2.2.0
data processing. Among other features, configurable amount of time. Producers are applications that
it allows writing streaming jobs the publish stream of records to one or more topics. In our case,
same way as writing batch jobs. It sup- Apache Kafka streams events out to Apache Spark consumers
ports Java, Scala and Python.
Elasticsearch A distributed, RESTful search and an- 6.2.2 which are subscribed to one or more topics for parsing their
alytics engine for performing and com- (oss) content, all of which is done in near real-time.
bining many types of searches struc-
tured, unstructured, geo or metric.
Spark Streaming API enables scalable, high-throughput,
Kibana An open source data visualization plu- 6.2.2 fault-tolerant stream processing of live data streams. Data
gin for Elasticsearch. It provides visual- (oss) can be ingested from many sources (e.g. Apache Kafka,
ization capabilities on top of the content
indexed on an Elasticsearch cluster. It
Flume, Twitter), but in this current version of IoT-Hub we
also supports remote I/O limited it to Apache Kafka. Spark Streaming API can be used
Docker A lightweight, stand-alone, executable 1.13.1 for processing the ingested data using complex algorithms
package of a piece of software that
includes everything needed to run it:
composed of high-level functions like map, reduce, join,
code, runtime, system tools, system li- and window. The processed data can be published to yet
braries, settings. another Kafka topic for further consumption or it can be stored
as results in HDFS, databases, or dashboards. In this work, we
have selected Elasticsearch as temporary storage system. One
1 https://falconframework.org/
of the reasons for this choice is elasticsearch-hadoop provides
2 https://kafka.apache.org/
3 https://spark.apache.org/
native integration between Elasticsearch and Apache Spark, in
4 https://www.elastic.co/ the form of an RDD (Resilient Distributed Dataset).
5 https://www.elastic.co/products/kibana Kibana offers interactive visualizations (e.g. histograms,
6 https://www.docker.com/ line graphs, pie charts, sunbursts, etc.) and advanced time
3
10th International Workshop on Science Gateways (IWSG 2018), 13-15 June 2018
series analysis on Elasticsearch data, by leveraging the full
Lancashire - Groundwater sensors- boreholes from 1 to 5
aggregation capabilities of Elasticsearch.
In order to have a portable, scalable, and elastic data-
platform, we created a Docker cluster for each of the previous
middleware and connected them via docker-compose, since it
allows us to setup and run multi-container environments. Fig-
ure 1 shows a visual description of the IoT-Hub components,
and the interactions from different roles (e.g. data producers,
data architects, data scientists, domain scientists) that we an-
ticipate. All Dockerfiles and docker-compose file
Vale of Pickering - Groundwater sensors - boreholes from 7 to 10
used to generate the IoT-Hub are available freely online in a
GitHub repository7 for allowing reproducibility and share our
approach among the scientific community.
For our experiments, we have used the NSF-Chameleon
cloud8 , using a CentOS7 image with 42 CPUS for deploying
our hub. Note that the proposed framework could be deployed
to any other Cloud system.
IV. C ASTE STUDY: E NVIRONMENTAL BASELINE
M ONITORING P ROGRAMME Fig. 2: Groundwater sensors from the areas of Lancashire
To demonstrate the feasibility of IoT-Hub, we have used and Vale of Pickering. In total, we have 9 boreholes (sensors
the Environmental Baseline Monitoring programme [4], which are attached to boreholes) placed at geographically distributed
provides the perfect scenario for testing our platform with IoT locations.
sensor data. The British Geological Survey (BGS), along with
partners from the Universities of Manchester, York, Birming-
ham, Bristol, and Public Health England (PHE), is carrying We have initially focused on Groundwater quality sensor
out a science-based environmental monitoring programme in data. However, very little work has to be done in IoT-Hub to
the areas of Lancashire [19] and Vale of Pickering [20]. This enable support to other sensors. This is discussed in Section V.
programme represents the first independent, integrated moni- These groundwater sensors are attached to boreholes, which
toring study to characterize the environmental baseline in areas are called emb1, emb2, . . . emb10. Figure2 shows the
subjected to close scrutiny in anticipation of the development locations of these boreholes. For simplicity, we have selected
of a nascent UK shale-gas industry. The monitoring involves emb2, emb3, and emb4 boreholes, but IoT-Hub supports
ways of managing high volume, highly varied data, generated any number of boreholes and sensors.
by a range of IoT sensor data, including: IoT-Hub collects in simulated real-time the water-quality
• Groundwater quality: The sensors installed in the bore- parameters described before, from sensors attached to the
holes provide real-time measurements of water-quality selected boreholes (marked as 2, 3, and 4 in Figure 2).
parameters: water level, temperature, pH, conductivity, Since we did not have direct access to these sensors, but
and dissolved gases (O2, CH4, CO2, Rn). access to yearly compressed files instead, monthly datasets
• Seismicity: The monitoring of background seismicity has were downloaded locally [19]. These datasets are originally
involved installation of a network of seismic stations in in comma separated values (CSV) format and sensors provide
the vicinity of the proposed shale gas wells. Real time a single reading for every hour (every line corresponds to an
seismic data are being collected from the array of stations hour reading) per each day. In this work, we simulate the
to help characterize current levels of seismic activity. The setup where sensor data corresponding to one hour arrives
information captured in near real-time includes: station every 10 seconds to test the ability of IoT-Hub to deal with
code, station name, seismic data (for a single channel), high frequency data transmission. To achieve this, a feeder
and timestamp. script was implemented to send POST requests messages (with
• Air composition: The monitoring equipment measures a sensor reading) to Falcon web services every 10 seconds.
concentrations of ozone (O3), particulate matter (PM1, Then, Falcon was configured to act as a producer, publishing
PM2.5, PM4, and PM10), nitrogen oxides (NO, NO2 streams to Apache Kafka by using the emb topic. Apache Kafka
and NOx), methane (CH4), non-methane hydrocarbons ingests those streams in real-time and makes them available
(NHMCs), hydrogen sulphide (H2S) and carbon dioxide to a Spark-Streaming application which acts as a consumer.
(CO2) as well as capturing meteorological information. This application is subscribed to the emb topic, and stores
the data in Elasticsearch (see Listing 1) after performing a
7 https://github.com/rosafilgueira/EMB datastreaming quality check over the values received (e.g. if the data is
8 https://www.chameleoncloud.org/ within the range specified by the sensor manufacturers). If
4
10th International Workshop on Science Gateways (IWSG 2018), 13-15 June 2018
the values are within the predefined range, all the data is 2) Minimal or no parameter tuning.
stored into the corresponding fields and the normal label 3) Both globalized and localized AD in evolving time
is stored in the qc field. Otherwise, the data is stored and series.
annotated with the anomaly label instead. Note that Apache The criteria address two main problems in the AD domain.
Kafka offers a decoupling and buffering of the input streams. First, the monitoring of the network for the detection of
Therefore producer-consumer need not to know about each anomalies without a-priori knowledge. Any selected method
other. From multiple sources producers can write data to any should be able to perform well both in global and local AD.
topic in kafka, and several consumers can be subscribed to the Detecting localized anomalies can be utilized as a warning
same topic making each of them different processing analysis. system to prevent device failures. As a consequence, this
Since kafka persists data to disc, consumers can be shut down delivers more consistent up-time, improved data quality, and
for performing changes and when they are restarted, they will thereby results in a more robust system. Second, the moni-
retrieve all data from the time they were offline. toring of global trends that change over time and detect any
unusual seasonal patterns. This can help in improving the
Listing 1: Elasticsearch index for storing groundwater values. understanding about the nature of seasonal patterns or changes
index : emb test
t y p e : emb in research environments.
fields : Due to the aforementioned reasons, the algorithm selected
s e n s o r i d −> t y p e : t e x t , d e s c r i p t i o n : I d o f t h e s e n s o r
d a t e −> t y p e : d a t e , d e s c r i p t i o n : D a t e (UTC) is Twitter’s Seasonal Hybrid Extreme Studentized Deviate (S-
t i m e −>t y p e : d a t e , d e s c r i p t i o n : Time (UTC) H-ESD) AD algorithm [21] that uses robust statistics with a
s e c −> t y p e : i n t e g e r , d e s c r i p . : M i c r o s i e m e n s p e r c e n t i m e r
ph −> t y p e : f l o a t , d e s c r i p t i o n : PH focus on analyzing long term and short trends in time series.
w a t e r l e v e l −> t y p e : f l o a t , d e s c r i p t i o n : Water l e v e l aOD The underlying algorithm employs time series decomposition
w a t e r t e m p−> t y p e : f l o a t , d e s c r i p t i o n : Water t e m p e r a t u r e
t d g −> t y p e : i n t e g e r , d e s c r i p t i o n : T o t a l d i s s o l v e d g a s to detect both global and local anomalies. A combination of
qc −> t y p e : t e x t , d e s c r i p t i o n : Q u a l i t y c o n t r o l piecewise approximation to extract the trend of a time series
and an ESD test for anomalies is performed to accommodate a
Once sensor data becomes available in Elasticsearch, dif-
more localized AD. What is of particular interest is the ability
ferent applications can be run to analyze it. These applications
to detect both local and global anomalies or seasonal changes
can vary from simple scripts to check the insertion of data, to
and identify when a different pattern emerges in a continuous
more complex machine learning analysis, such as the anomaly
data flow.
detection that will be explained in the following subsection.
The only algorithm parameter that was pre-configured was
Furthermore, IoT-Hub includes Kibana for exploring the
the maximum anomaly (set to 0.2), m, that controls the al-
data visually (see Figure 3 as an example).
gorithm’s upper bound of suspected anomalies. The algorithm
ran for each of the different measurements. For brevity, in
Figure 4 only the most interesting AD case is shown. In this
case, the first (left) and the second run (right) of the algorithm
after 15 minutes are shown. The first figure shows normal
deviation and no anomalies are detected. In the second run,
the algorithm picks up the anomalies that show a small spike
before the sensor stops provides readings of value zero.
SEC AD SEC AD
845.2 1100
900
S/cm)
845
)
SEC(uS/cm)
SEC (S/cm
844.7 700
Fig. 3: Kibana screenshot for filtering the normal values, and
SEC(u
counting the different values of the water level parameter. 844.5 400
844.2 200
A. Anomaly Detection 844 0
Feb 1 Feb 2 Feb 3 Feb 4 Feb 5 Feb 6 Feb 7 Feb 8 Feb 9
An anomaly detection (AD) algorithm has been imple-
mented to periodically interrogate the data that is automati- Fig. 4: IoT-Hub: AD results in emb3, Lancashire sensor for
cally collected, processed, and stored by IoT-Hub. The AD February 2017. First run (left), second run (right). Detected
algorithm is called every 15 minutes to review all the data anomalies are shown as turquoise dots.
transmitted in that period. The IoT-Hub presents new chal-
lenges in an AD context due to its continuous data streaming. The performance of the method is promising in a real world
Therefore, we have established a three criteria that an scenario for AD. However, as can be seen by the high number
algorithm must fulfill in IoT-Hub: of anomalies there is always the caveat of too many false
1) Robustness in seasonal changes (i.e., weather patterns). positives, which is something to be avoided in real world
5
10th International Workshop on Science Gateways (IWSG 2018), 13-15 June 2018
scenarios. The parameter selection has also to be evaluated European e-Infrastructures”, EU H2020 funded project (No 777413).
periodically. Careful consideration of algorithms that fulfill the We thank the NSF Chameleon Cloud for providing time grants to
above criteria with a focus on precision rather than recall have access their resources. This work contains British Geological Survey
materials NERC [2017 and 2018 years].
to be chosen in order to be able to deliver robust results.
In the future, we plan to create a warning system (e.g., via R EFERENCES
a VRE fronted) that makes use of IoT-Hub for running the [1] J. Gubbi, R. Buyya, S. Marusic, and M. Palaniswami, “Internet of things
described AD analysis to interpret sensors in the field. This (iot): A vision, architectural elements, and future directions,” Future
system will send out alerts to domain scientists subscribed to Gener. Comput. Syst., vol. 29, no. 7, pp. 1645–1660, 2013.
[2] M. Atkinson and M. Parsons, “The digital-data challenge,” in The DATA
these alerts, as well as to those in charge of deploying and Bonanza – Improving Knowledge Discovery for Science, Engineering
maintaining the sensors in the field. and Business, M. P. Atkinson et al., Eds. Wiley, 2013, ch. 1, pp. 5–13.
[3] L. Candela, D. Castelli, and P. Pagano, “Virtual research environments:
V. C ONCLUSIONS AND F UTURE W ORK An overview and a research agenda,” Data Science Journal, vol. 12, pp.
GRDI75–GRDI81, 2013.
In this paper, we have presented IoT-Hub, which delivers [4] “Instrumenting the earth,” http://www.bgs.ac.uk/Sensors/.
specialized data-framework to exploit the IoT in a scalable, [5] C. E. Catlett, “Teragrid: A foundation for us cyberinfrastructure,” in
efficient, and robust manner reducing the engineering time Proceedings of the 2005 IFIP International Conference on Network and
Parallel Computing (NPC’05), 2005, pp. 1–1.
and computational cost. We have demonstrated the feasi- [6] M. Atkinson, M. Carpene, E. Casarotti, S. Claus, R. Filgueira et al.,
bility of the proposed solution by using the Environmental “Verce delivers a productive e-science environment for seismology
Baseline Monitoring programme. Data from different sensors research,” in 2015 IEEE 11th International Conference on e-Science
(e-Science)(E-SCIENCE), vol. 00, 2015, pp. 224–236.
are collected, preprocessed, and stored in real-time using [7] L. de la Garza, J. Krger, C. Schrfe, M. Rttig, S. Aiche, K. Reinert, and
different microservices middleware. To test the capacity of O. Kohlbacher, “From the desktop to the grid: conversion of knime
IoT-Hub for running complex data-analytics tasks, we have workflows to guse,” in Proc. IWSG 2013, ser. CEUR-WS, vol. 993,
2013, p. 9. [Online]. Available: http://ceur-ws.org/Vol-993/
implemented an anomaly detection algorithm, which queries [8] C. Kiddle et al., “Cyberska: An on-line collaborative portal for data-
data from Elasticsearch and detects the anomalies of each of intensive radio astronomy,” in 2011 ACM Workshop on Gateway Com-
the water-quality parameters. All the middleware that forms puting Environments (GCE ’11), 2011, pp. 65–72.
[9] R. Filgueira et al., “escience gateway stimulating collaboration in rock
IoT-Hub has been containerized, which enables flexible and physics and volcanology,” in 2014 IEEE 10th International Conference
agile development, and deployment in cloud-based infrastruc- on e-Science - Volume 01, ser. E-SCIENCE ’14, 2014, pp. 187–195.
tures. [10] D. De Roure, C. Goble, and R. Stevens, “The design and realisation of
the myExperiment Virtual Research Environment for social sharing of
The current version of IoT-Hub has been pre-configured workflows,” Future Generation Computer Systems, vol. 25, no. 5, pp.
for working with groundwater sensors. To extend this work to 561–567, 2009.
other sensors, it only would require to: (1) create a new Kafka [11] G. Hager and G. Wellein, Introduction to High Performance Computing
for Scientists and Engineers, 1st ed. Boca Raton, FL, USA: CRC Press,
topic to produce and consume new datasets; (2) modify the Inc., 2010.
Spark-Streaming application to consume and check the data [12] I. J. Taylor, E. Deelman, D. B. Gannon, and M. Shields, Workflows for
from this new topic; and (3) create a new Elasticsearch index. e-Science: scientific workflows for grids. Springer Publishing Company,
Incorporated, 2014.
One of the main uses of IoT-Hub could be to act as a [13] J. Liu, E. Pacitti, P. Valduriez, and M. Mattoso, “A survey of
backend for VREs or Scientific Gateways for running data- data-intensive scientific workflow management,” J. Grid Comput.,
intensive applications and deploying cloud resources upon vol. 13, no. 4, pp. 457–493, Dec. 2015. [Online]. Available:
http://dx.doi.org/10.1007/s10723-015-9329-8
request. [14] C.-W. Tsai, C.-F. Lai, H.-C. Chao, and A. V. Vasilakos, “Big data
As future work, we plan to include more middleware in analytics: a survey,” Journal of Big Data, vol. 2, no. 1, p. 21, 2015.
IoT-Hub, such as Cassandra database (for high performance [15] E. Ahmed, I. Yaqoob, I. A. T. Hashem, I. Khan, A. I. A. Ahmed,
M. Imran, and A. V. Vasilakos, “The role of big data analytics in internet
operations and handling massive datasets), an RDF repository of things,” Comput. Netw., vol. 129, no. P2, pp. 459–471, 2017.
(to store and harvest RDF data), SparQL Endpoint (to query [16] “Iot big data framework architecture,” https://www.gsma.com/iot/
a knowledge base via the SPARQL language), a job submis- wp-content/uploads/2016/11/CLP.25-v1.0.pdf.
[17] S. S. Gill, I. Chana, and R. Buyya, “Iot based agriculture as a cloud
sion system (to submit applications to distributed computing and big data service: The beginning of digital india,” JOEUC, vol. 29,
infrastructures), and Jupyter Notebook (to offer an interactive no. 4, pp. 1–23, 2017.
computational environment). [18] I. Nadareishvili, R. Mitra, M. McLarty, and M. Amundsen, Microser-
vice Architecture: Aligning Principles, Practices, and Culture, 1st ed.
Additionally, we also plan to create a warning system that O’Reilly Media, Inc., 2016.
makes use of IoT-Hub for running the described AD analysis [19] “Environmental baseline monitoring in the lancashire,”
to interpret sensors in the field, and leverage IoT-Hub capa- http://www.bgs.ac.uk/research/groundwater/shaleGas/monitoring/
lancsDataSummary.html.
bilities to process near real-time logs from scientific workflow [20] “Environmental baseline monitoring in the vale of pickering,”
executions [22]. http://www.bgs.ac.uk/research/groundwater/shaleGas/monitoring/
vopDataSummary.html.
Acknowledgments. This work was carried out when the lead author [21] O. Vallis, J. Hochenbaum, and A. Kejariwal, “A novel technique for
was with the British Geological Survey. It was funded under the Scot- long-term anomaly detection in the cloud,” in 6th USENIX Conference
tish Informatics and Computer Science Alliance with the Postdoctoral on Hot Topics in Cloud Computing (HotCloud’14), 2014, pp. 15–15.
and Early Career Researcher Exchanges fellowship, partially funded [22] E. Deelman et al., “PANORAMA: An approach to performance mod-
by DOE under Contract DESC0012636, “ Panorama - Predictive eling and diagnosis of extreme scale workflows,” International Journal
Modeling and Diagnostic Monitoring of Extreme Science Work- of High Performance Computing Applications, vol. 31, no. 1, pp. 4–18,
flows”, and by “ DARE -Delivering Agile Research Excellence on 2017.
6