RaDEN: A Scalable and Efficient Radiation Data
                   Engineering
            Hadi Fadlallah                                  Yehia Taher                                  Ali Jaber
         Lebanese University                                                 -Quentin               Lebanese University
           Beirut, Lebanon                              -en-Yvelines (UVSQ)                           Beirut, Lebanon
      Hadi.Fadlullah@gmail.com                            Versailles, France                        ali.jaber@ul.edu.lb
                                                         yehia.taher@uvsq.fr


    Abstract Detecting and monitoring radiation level is one         high speed. Conventional data engineering technologies
of the critical duties for governments and researchers because       such as data warehouse are not adequate to handle this type
of the high threats it oppose to humans. It was challenging in       of data.
the past century to have a centralized radiation monitoring              Several data engineering technologies have been
system until the rise of IoT (Internet of Things). Radiation level
                                                                     proposed in literature such as [5],[6],[7],[8],[9],[10],[11],
is measured using wireless sensors that outputs data which are
transferred to a back-end server that monitors radiation and         [12],[13],[14],[15],[16] and many others. These solutions
alerts when high radiation levels are detected, the server also      aims engineering radiation pollution data. However, existing
stores the data for further analysis. The traditional data           solutions have several limitations that we summarized as
warehousing systems cannot handle this type of data any more         follows: (1) Existing technologies rely mainly on traditional
due to (1) data collection speed, (2) rapid data growth, and (3)     data technologies. (2) Most of them are focused on the data
data diversity. With the rise of Big Data notion, new                collection only. (3) Real-time data collection and processing
technologies are developed to handle data with similar               is outside of the scope of existing technologies. (4)
characteristics. In this paper, we proposed RaDEn a scalable         Scalability and fault-tolerance have not been dealt with by
and fault-tolerant radiation data engineering system that relies
                                                                     the technologies discussed in the previous sections. A
on Big Data technologies such as Hadoop, Kafka, Spark, and
Hive. The system is responsible of (1) reading data from             solution that can address these limitations is an
sensors and other sources, (2) monitor the radiation level in        indispensable need.
real-time, (3) storing the data, and (4) providing on-demand             In this paper, we have proposed a solution called RaDEn,
data retrieval to users. In addition, we have implemented our        which is a scalable and fault-tolerant system for radiation
system and conducted experiments in a real case scenario in          data engineering that relies mainly on new data technologies
collaboration with the department of environmental radiation         that are able to handle massive volume of data generated in
control at the Lebanese Atomic Energy Commission (LAEC-              high speed. RaDEn has the ability to read data from
CNRS).                                                               different sources, monitor radiation level in real-time,
  Keywords Radiation, data engineering, Big Data, radiation
                                                                     storing data in a scalable repository that provides on-
monitoring, real-time processing                                     demand data retrieval to users for further analysis.
                                                                         The remainder of this paper is organized as follows. In
                                                                     Section 2, we briefly introduce our solution called RaDEn.
                      I. INTRODUCTION
                                                                     The development of RaDEn will be detailed in Section 3.
                                                                     Section 4 demonstrates RaDEn. We conclude our work in
   Radiation pollution is a critical concern due to its
                                                                     Section 5.
detrimental impact on living beings and environment. There
are different types of radiation stemming from various
radioactive materials and natural resources [1]. The higher                         II. AN OVERVIEW OF RADEN
level of these radiations specifically the gamma radiation
                                                                         RaDEn is a scalable platform developed for radiation
causes severe damage to human health [2]. Therefore,
                                                                     data engineering. It allows fetching massive volume of data
controlling radiation level is critically important. In order to     from different sources. RaDEn enables user collection
do so, monitoring radiation sources is an indispensable task.        different types of data including such as structured databases,
The advent IoT (Internet of Things) specifically, sensors            data streams and flat files. RaDEn has a radiation data lake
have paved the foundation of building smart ecosystems that          which stores data a scalable cluster, process then with
enable collecting radiation data, processing, and analyzing          advanced techniques and visualize data using the best fit
radiation level in real-time [3]. Radiation sensors collect and      methods.
transmit data via communication network such as                          RaDEn adopted both realtime and batch style
telecommunication network, Wi-Fi, and Internet to the                philosophies for collecting and processing data. The hybrid
computational engine for measuring radiation levels.                 enables users to perform both realtime and batch style
Radiation monitoring sensors records data continuously; in           operations. The data streaming from sensors can be collected
consequence, massive volume data can be generated in a               by the users in realtime and files can be ingested in storage


                                                                                                                                       89
as batch style. The processing and can be done the same                  Data visualization: This layer is to visualize the results
ways. Besides these major operations it somewhat performs          of data processing and is responsible of bringing the end-user
pre-processing tasks such data transformation and loading          in action by drawing real-time graphs that show radiation
data into scalable data lake. Visualization is real-time           level timeline.
meaning that the streams can be visualized with minimal
latency and the can be done by files data.                              Coordination layer: This layer is responsible of making

    RaDEn is built on cluster computing and parallel               It is a service that runs in background and has the ability to
computing paradigm. In addition, it adopts the notion of Big       connect with any technology used.
Data. The cluster computing guides the solution adopt
technologies that foster scalability whereas the parallel                         III. DEVELOPMENT OF RADEN
computing provides computation models for designing
parallel operations using suitable programming model such             RaDEn is developed in two phases. In the first phase, we
as functional programming model.                                   developed the RaDEn system and in the second phase we
                                                                   developed an alarm system integrated within RaDEn.
    RaDEn is built-on multi-layered architecture. Figure 1
shows the architecture of RaDEn. The figure shows that the         A. RaDEn Core System
RaDEn system can collect data from any number of sources.
RaDEn consist of six layers which are explained in the                 We have built a 4-node Hadoop cluster. To build this
following:                                                         system we have first deployed four virtual machines where
                                                                   we have installed Ubuntu 1 16.04 LTS as operating system
       Data Sources: Data sources layer consists of data           and Hadoop 3.1.0 for data storage. We have configured the
streams sent from sensors installed in cities and mountains,       first virtual machine to act as the master node (name node)
relational databases where archive data are stored and flat        and the others to act as slaves (only stores data).
files that can be exported from any old echo system
                                                                      On the master node, we have also installed the data
                                                                   ingestion, processing and visualization tools. As a
                                                                   programming language, We have used python2 because it is
                                                                   more powerful than other languages in data science domain
                                                                   due to the presence of many specialized libraries.
                                                                       For data ingestion, we have installed Apache Kafka 3 ,
                                                                   Apache Flume4, and Apache Sqoop5. We used Apache Kafka
                                                                   as the main data ingestion tools, because: (1) it has the ability
                                                                   to read data from sensors directly. (2) It guarantees
                                                                   scalability and fault tolerance. (3) It can read data in real-
                                                                   time and at rest. (4) It can send the data to processing engine
                                                                   and to the data storage layer. (5) It is easy to implement
                                                                   using python programming language. To use Apache Kafka,
                                                                   first we have installed the Apache Zookeeper6 which acts as
                                                                   a coordinator that lets Apache Kafka communicate with
                                                                   other technologies. Then, we created two topics: (1)
                                                                                                                   to insert data to
                                                                   HDFS (Hadoop Distributed File System) without any

              Figure 1 - RaDEn architecture                        used for real-time processing and will insert data to HDFS
                                                                   Note that the first topic will be used for archive data only.
      Data ingestion layer: This layer is responsible of                We have configured Apache Flume agents to read from
reading data from different sources and delivering them to         Kafka topics consumers. The Flume agent is responsible of
data processing or data storage layer. This layer must ensure      storing data from Kafka to HDFS. In addition, we used
scalability and fault tolerance and must read a huge amount        Apache Sqoop to read data from relational databases and
of data from different sources in real-time and batch mode.        store it into HDFS. Reading archive data from relational
Data can be stored into Data storage layer directly or it can      databases is not one of the main goals of the system, but it is
be sent to Data processing layer to be processed in real-time.     an added value to allow user to migrate their old data from
      Data Storage: This layer is responsible of storing data.     traditional warehousing system. In addition, we have used
It relies heavily on HDFS, which is a distributed file system      Apache Hive 7 to define the metadata of the file stored in
that ensures high scalability and fault-tolerance. On the Top      HDFS to make the data retrieval process more easily using
of HDFS we have to use warehousing technology to define            SQL-Like languages such as HiveQL and Spark-SQL.
and configure the metadata of the data stored in HDFS and
let the users be able to perform easy data retrieval operations.   1
                                                                     https://ubuntu.com
      Data Processing Layer: This layer is responsible of          2
                                                                     https://python.org
processing data and notifying the end-user when a high level       3
                                                                     https://kafka.apache.org
radiation is detected. The nature of the data sources requires     4
                                                                     http://fume.apache.org
a distributed data processing platform that ensures a high         5
                                                                     http://sqoop.apache.org
scalability and fault tolerance.                                   6
                                                                     https://zookeeper.apache.org
                                                                   7
                                                                     https://hive.apache.org


                                                                                                                                       90
    For Data processing, our solution relies heavily on              (Figure 4), (4) Apache Flume agent (Figure 5), (5) Python
Apache Spark 8 for these main reasons: (1) it uses micro             data processing script (if the user want to insert data to
batching processing instead of stream processing, which              HDFS without visualizing data on a real-time graph,
more guarantee fault-tolerance, and in this solution fault           running this script is not needed).
tolerance is critical even if it may cause a few milliseconds
latency [4]. (2) It can process data at rest and in real-time. (3)
Spark has a wrapper library called PySpark that allows
creating and running Apache Spark jobs in python. (4) It is
scalable so we can add more nodes when it is required.
    We have created a single node Apache Spark cluster as
the first phase of deployment and we can add other slave
nodes when it is required. In addition, we have use pandas9
python library because it provides many classes and
functions that makes handling data easier.
   For Data visualization, we used matplotlib 10 library
which is a Python 2D plotting library which produces
publication quality figures in a variety of hardcopy formats
and interactive environments across platforms. It also allows                   Figure 2 - Starting Hadoop services
drawing real-time graphs.

B. RaDEn Alarm System
The data processing and visualization is done by a python
script, where we have implemented the radiation alert
system (designed based on the Lebanese Atomic Energy
Commission requirements) that work as the following: (1)
First, the user must define a threshold value. (2) The
radiation average is calculated based on the last 30 day from
the current date. (3) Then, when the current radiation value
is higher than the sum of the radiation average and the

                                                                                  Figure 3 -Starting Kafka services
                                                   ause the rain
increase the radiation level. (5) If an alert was raised (any
level) and the radiation level is still high after 5 hours then a

technician to visit the sensor location to do a checkup.
              IV. DEMONSTRATION OF RADEN
   In this section, we demonstrate RaDEn. For our
demonstration we used a radiation dataset supplied by the
department of environmental radiation control at the                             Figure 4 - Starting Spark services
Lebanese Atomic Energy Commission (LAEC-CNRS).
A. Dataset
    The dataset was provided by the LAEC-CNRS, is in
form of flat files, because accessing the sensors or the web
server (relational database) was not made due to
confidentiality issues. The dataset contains the data
collected from 2015-08-01 to 2016-08-01 from a testing
sensor that was installed in Beirut. It contains information
related to radiation such as: radiation level, temperature, rain
level, Sensor battery power, data collection time and
external battery power.
B. Starting RaDEn                                                                  Figure 5 - Starting Flume agent
    Before starting the process, the user must start the             C. Data Ingestion
following services: (1) Hadoop cluster (Figure 2), (2)                   To simulate data ingestion we have create a directory
Apache Kafka service (Figure 3), (3) Apache Spark cluster            where we must copy all flat files, and we created a terminal
                                                                     script that creates a listener on this folder. When any file is
8
  https://spark.apache.org                                           inserted, the script loops over the lines and send them one
9
  https://pandas.pydata.org                                          by one to the Apache Kafka producer. Once the data is sent
10
   https://matplotlib.org                                            to the Kafka producer, the Apache Flume agent sent it


                                                                                                                                       91
directly to HDFS to a specific directory and the data is
replicated on the three Hadoop data nodes. Figure 6 shows
the data ingested into HDFS via Hadoop web interface.
Also, figure 7 shows that the file stored in HDFS is
replicated on 3 data nodes.


                                                                            Figure 8 - Real-time graph screenshots

                                                                       In addition, when an alert is raised it is shown in a
                                                                   message box where the alarm level is written in the title and
                                                                   the description is written in the body. In figure 9, we have
                                                                   showed the level 1 alert message box.
            Figure 6 - Files ingested into HDFS


                                                                             Figure 9 - Alarm Level 1 Messagebox
                                                                   E. Data Retrieval
                                                                       Based on the Flat files structure and for data retrieval
                                                                   purposes, we created an External Table using Apache Hive
                                                                   on the Directory location in HDFS. Then we created a view
                                                                   from this table to remove messy data such as duplicates
                                                                   rows and rows where dose rate is null.
                                                                       After created external table on HDFS directory, we can
                                                                   search among the data imported to HDFS using Spark-SQL
                                                                   or HiveQL console. We have to write a query based on our
                                                                   requirements. As example, we need to retrieve all the data
                                                                   where the radiation level is higher than 50. First, we need to
                                                                   run the Spark SQL console using the spark-sql command or
                                                                   hive command to start Hive console, and we can use the
                                                                   following query:
             Figure 7 - Stored file availability
                                                                      SELECT * FROM vw_radiation WHERE dose_rate > 50;
D. Radiation Monitoring
    At the same time, the python script read the data from             In this example, it shows that the number of rows
Kafka consumer using the PySpark library, the alarm script         returned is 10459 rows in 13.99 seconds as illustrated in
is applied, and the radiation level is visualized on a real-time   Figure 10.
graph using matplotlib library. In figure 8, we have shown
sequential snapshots of the real-time graph that was
visualized during the experiments and it shows the radiation
level changes in function of date and time.


                                                                            Figure 10    Apache Hive query results.
                                                                                       V. CONCLUSION
                                                                        In this paper, we designed a solution called RaDEn that
                                                                   is able to handle a massive scale of data in real-time and


                                                                                                                                    92
batch style. It allow user to process the radiation data                           international journal of professional engineering studies, vol. 9, no. 3,
                                                                                   pp. 182-186, 2017.
coming from different sources, predict any possible
                                                                              [8] Pablo Andrade Grossi, Leonardo Soares de Souza, Geraldo Magela
radiation problems and visualize the data in real-time. In                         Figueiredo, Arthur Figueiredo, "Management Information System
addition, it gives the user the ability to query and retrieve                      Applied to Radiation Protection Services," in 2013 International
the data using a simple SQL-Like language. In addition, we                         Nuclear Atlantic Conference - INAC 2013, Brazil, 2013.
                                                                              [9] Kalpana.k, Shruti, Shweta, Bhagyasri, "An Integrated system For
explained how we implemented this solution and we showed
                                                                                   Regional Environmental Monitoring and Management Based on IoT,"
a real case scenario.                                                              Internatonal     Journal     of    Information     Technology        and
     We tried to cover all the challenges we identified at the                     ComputerEngineering, no. 16, pp. 60-64.
beginning of our work but unfortunately, the                                  [10] Eran Vax, Benny Sarusi, Mati Sheinfeld, Shmuel Levinson, Irad
                                                                                   Brandys, Danny Sattinger , Udi Wengrowicz, Avi Tshuva, Dan
implementation we have made have some limitations due to
                                                                                   Tirosh, "ERMS Environmental Radiation Monitoring System," Beer
the following issues: (1) we received a very small dataset                         Sheva, Israel.
that cannot be considered as Big Data while our system is                     [11]                                                       -Based Radiation
designed to handle Big Data. (2) We did not get permissions                        Monitoring and Warning System," Romania.
                                                                              [12] Xiaoyu Wanga, Zhaoguo Wang, Liyuan Xu, Deyun Chen, "Wireless
to access the sensors or the databases. (3) There is a lack of
                                                                                   Communications Radiation Monitoring System Based on ZigBee and
documentation for Big Data technologies. (4)           The                         GPRS," Advanced Materials Research, Vols. 403-408, no. 1662-
research time limit.                                                               8985, pp. 1956-1959, 2012.
     A list of works is lined up to be done in future. More                   [13] Camelia Avram, Silviu Folea, Dan Radu and Adina Astilean ,
                                                                                   "Wireless Radiation Monitoring System," in European Conference on
powerful tools such as bokeh11 and Kibana12 can be used to
                                                                                   Modelling and Simulation, Romania.
increase performance and more options to the end-user and                     [14] Cheng-Jian, Z., Xian-Hua, L., Xiang-Yong, S., & Qing-Zhou, L.,
are able to draw a huge number of real-time graphs at the                          "Analysis on the correlation of atmospheric path radiation and air
same time. In addition, adding some user interface will                            pollution index," in Urban Remote Sensing Event, 2009.
                                                                              [15] Baker, C., Davidson, G., Evans, T. M., Hamilton, S., Jarrell, J., &
make this this solution more powerful, because the current
                                                                                   Joubert, W., "High performance radiation transport simulations:
implementation is not user friendly; due to the lack of user                       preparing for Titan," in International Conference on High
interface, also it requires a good knowledge in SQL to be                          Performance Computing, Networking, Storage and Analysis, 2012.
able to retrieve data from HDFS. In addition, the solution                    [16] Jeong, M. H., Sullivan, C. J., & Wang, S., "Complex radiation sensor
                                                                                   network analysis with Big Data analytics," in In Nuclear Science
should be extended enhance the user the ability to visualize
                                                                                   Symposium and Medical Imaging Conference, 2015.
the results of queries on different types of graphs.
     Furthermore, we stored data as flat files in HDFS, to
improve the performance; we could create an automate job
that run periodically and move new data to another HDFS
location and convert it to Optimized Row Columnar (ORC)
files which gives faster results. Also in future extension,
distributed search engines such as Elasticsearch and Solr 13
can be used for data retrieval process.
                             REFERENCES
[1]   "Alpha, Beta, Gamma, X-Ray, and Neutron Radiation," Mirion
      technologies,                   [Online].                  Available:
      https://www.mirion.com/introduction-to-radiation-safety/types-of-
      ionizing-radiation. [Accessed 17 September 2018].
[2]   "Ionising Radiation and Human Health," Australian government -
      department of health, 07 December 2012. [Online]. Available:
      http://www.health.gov.au/internet/publications/publishing.nsf/Content
      /ohp-radiological-toc~ohp-radiological-05-ionising. [Accessed 17
      September 2018].
[3]   "Wireless Sensor Networks to Control Radiation Levels," Libelium,
      19          April         2011.           [Online].        Available:
      http://www.libelium.com/wireless_sensor_networks_to_control_radia
      tion_levels_geiger_counters. [Accessed 17 September 2018].
[4]   C. Prakash, "Spark Streaming vs Flink vs Storm vs Kafka Streams vs
      Samza : Choose Your Stream Processing Framework," 21 March
      2018. [Online]. Available: https://www.linkedin.com/pulse/spark-
      streaming-vs-flink-storm-kafka-streams-samza-choose-prakash/.
      [Accessed 10 9 2018].
[5]   Hsin-Fa Fang, Jeng-Jong Wang, Ing-Jang Chen and Jih-Hung Chiu,
      "The application of GPS, GIS and GPRS in Environmental Radiation
      Survey," Taiwan.
[6]   G. Segura Millan, D. Perrin, L. Scibile, "RAMSES: The LHC
      Radiation Monitoring System for the Environment and Safety," in
      10th ICALEPCS Int. Conf. on Accelerator & Large Expt. Physics
      Control Systems, Geneva, 2005.
[7]   L. R. NAIK, "An Integrated System for Regional Environmental
      Monitoring and Management Based on Internet of Things,"


11
   https://bokeh.pydata.org/
12
   https://www.elastic.co/products/kibana
13
   http://lucene.apache.org/solr/


                                                                                                                                                               93