RaDEN: A Scalable and Efficient Radiation Data Engineering Hadi Fadlallah Yehia Taher Ali Jaber Lebanese University -Quentin Lebanese University Beirut, Lebanon -en-Yvelines (UVSQ) Beirut, Lebanon Hadi.Fadlullah@gmail.com Versailles, France ali.jaber@ul.edu.lb yehia.taher@uvsq.fr Abstract Detecting and monitoring radiation level is one high speed. Conventional data engineering technologies of the critical duties for governments and researchers because such as data warehouse are not adequate to handle this type of the high threats it oppose to humans. It was challenging in of data. the past century to have a centralized radiation monitoring Several data engineering technologies have been system until the rise of IoT (Internet of Things). Radiation level proposed in literature such as [5],[6],[7],[8],[9],[10],[11], is measured using wireless sensors that outputs data which are transferred to a back-end server that monitors radiation and [12],[13],[14],[15],[16] and many others. These solutions alerts when high radiation levels are detected, the server also aims engineering radiation pollution data. However, existing stores the data for further analysis. The traditional data solutions have several limitations that we summarized as warehousing systems cannot handle this type of data any more follows: (1) Existing technologies rely mainly on traditional due to (1) data collection speed, (2) rapid data growth, and (3) data technologies. (2) Most of them are focused on the data data diversity. With the rise of Big Data notion, new collection only. (3) Real-time data collection and processing technologies are developed to handle data with similar is outside of the scope of existing technologies. (4) characteristics. In this paper, we proposed RaDEn a scalable Scalability and fault-tolerance have not been dealt with by and fault-tolerant radiation data engineering system that relies the technologies discussed in the previous sections. A on Big Data technologies such as Hadoop, Kafka, Spark, and Hive. The system is responsible of (1) reading data from solution that can address these limitations is an sensors and other sources, (2) monitor the radiation level in indispensable need. real-time, (3) storing the data, and (4) providing on-demand In this paper, we have proposed a solution called RaDEn, data retrieval to users. In addition, we have implemented our which is a scalable and fault-tolerant system for radiation system and conducted experiments in a real case scenario in data engineering that relies mainly on new data technologies collaboration with the department of environmental radiation that are able to handle massive volume of data generated in control at the Lebanese Atomic Energy Commission (LAEC- high speed. RaDEn has the ability to read data from CNRS). different sources, monitor radiation level in real-time, Keywords Radiation, data engineering, Big Data, radiation storing data in a scalable repository that provides on- monitoring, real-time processing demand data retrieval to users for further analysis. The remainder of this paper is organized as follows. In Section 2, we briefly introduce our solution called RaDEn. I. INTRODUCTION The development of RaDEn will be detailed in Section 3. Section 4 demonstrates RaDEn. We conclude our work in Radiation pollution is a critical concern due to its Section 5. detrimental impact on living beings and environment. There are different types of radiation stemming from various radioactive materials and natural resources [1]. The higher II. AN OVERVIEW OF RADEN level of these radiations specifically the gamma radiation RaDEn is a scalable platform developed for radiation causes severe damage to human health [2]. Therefore, data engineering. It allows fetching massive volume of data controlling radiation level is critically important. In order to from different sources. RaDEn enables user collection do so, monitoring radiation sources is an indispensable task. different types of data including such as structured databases, The advent IoT (Internet of Things) specifically, sensors data streams and flat files. RaDEn has a radiation data lake have paved the foundation of building smart ecosystems that which stores data a scalable cluster, process then with enable collecting radiation data, processing, and analyzing advanced techniques and visualize data using the best fit radiation level in real-time [3]. Radiation sensors collect and methods. transmit data via communication network such as RaDEn adopted both realtime and batch style telecommunication network, Wi-Fi, and Internet to the philosophies for collecting and processing data. The hybrid computational engine for measuring radiation levels. enables users to perform both realtime and batch style Radiation monitoring sensors records data continuously; in operations. The data streaming from sensors can be collected consequence, massive volume data can be generated in a by the users in realtime and files can be ingested in storage 89 as batch style. The processing and can be done the same Data visualization: This layer is to visualize the results ways. Besides these major operations it somewhat performs of data processing and is responsible of bringing the end-user pre-processing tasks such data transformation and loading in action by drawing real-time graphs that show radiation data into scalable data lake. Visualization is real-time level timeline. meaning that the streams can be visualized with minimal latency and the can be done by files data. Coordination layer: This layer is responsible of making RaDEn is built on cluster computing and parallel It is a service that runs in background and has the ability to computing paradigm. In addition, it adopts the notion of Big connect with any technology used. Data. The cluster computing guides the solution adopt technologies that foster scalability whereas the parallel III. DEVELOPMENT OF RADEN computing provides computation models for designing parallel operations using suitable programming model such RaDEn is developed in two phases. In the first phase, we as functional programming model. developed the RaDEn system and in the second phase we developed an alarm system integrated within RaDEn. RaDEn is built-on multi-layered architecture. Figure 1 shows the architecture of RaDEn. The figure shows that the A. RaDEn Core System RaDEn system can collect data from any number of sources. RaDEn consist of six layers which are explained in the We have built a 4-node Hadoop cluster. To build this following: system we have first deployed four virtual machines where we have installed Ubuntu 1 16.04 LTS as operating system Data Sources: Data sources layer consists of data and Hadoop 3.1.0 for data storage. We have configured the streams sent from sensors installed in cities and mountains, first virtual machine to act as the master node (name node) relational databases where archive data are stored and flat and the others to act as slaves (only stores data). files that can be exported from any old echo system On the master node, we have also installed the data ingestion, processing and visualization tools. As a programming language, We have used python2 because it is more powerful than other languages in data science domain due to the presence of many specialized libraries. For data ingestion, we have installed Apache Kafka 3 , Apache Flume4, and Apache Sqoop5. We used Apache Kafka as the main data ingestion tools, because: (1) it has the ability to read data from sensors directly. (2) It guarantees scalability and fault tolerance. (3) It can read data in real- time and at rest. (4) It can send the data to processing engine and to the data storage layer. (5) It is easy to implement using python programming language. To use Apache Kafka, first we have installed the Apache Zookeeper6 which acts as a coordinator that lets Apache Kafka communicate with other technologies. Then, we created two topics: (1) to insert data to HDFS (Hadoop Distributed File System) without any Figure 1 - RaDEn architecture used for real-time processing and will insert data to HDFS Note that the first topic will be used for archive data only. Data ingestion layer: This layer is responsible of We have configured Apache Flume agents to read from reading data from different sources and delivering them to Kafka topics consumers. The Flume agent is responsible of data processing or data storage layer. This layer must ensure storing data from Kafka to HDFS. In addition, we used scalability and fault tolerance and must read a huge amount Apache Sqoop to read data from relational databases and of data from different sources in real-time and batch mode. store it into HDFS. Reading archive data from relational Data can be stored into Data storage layer directly or it can databases is not one of the main goals of the system, but it is be sent to Data processing layer to be processed in real-time. an added value to allow user to migrate their old data from Data Storage: This layer is responsible of storing data. traditional warehousing system. In addition, we have used It relies heavily on HDFS, which is a distributed file system Apache Hive 7 to define the metadata of the file stored in that ensures high scalability and fault-tolerance. On the Top HDFS to make the data retrieval process more easily using of HDFS we have to use warehousing technology to define SQL-Like languages such as HiveQL and Spark-SQL. and configure the metadata of the data stored in HDFS and let the users be able to perform easy data retrieval operations. 1 https://ubuntu.com Data Processing Layer: This layer is responsible of 2 https://python.org processing data and notifying the end-user when a high level 3 https://kafka.apache.org radiation is detected. The nature of the data sources requires 4 http://fume.apache.org a distributed data processing platform that ensures a high 5 http://sqoop.apache.org scalability and fault tolerance. 6 https://zookeeper.apache.org 7 https://hive.apache.org 90 For Data processing, our solution relies heavily on (Figure 4), (4) Apache Flume agent (Figure 5), (5) Python Apache Spark 8 for these main reasons: (1) it uses micro data processing script (if the user want to insert data to batching processing instead of stream processing, which HDFS without visualizing data on a real-time graph, more guarantee fault-tolerance, and in this solution fault running this script is not needed). tolerance is critical even if it may cause a few milliseconds latency [4]. (2) It can process data at rest and in real-time. (3) Spark has a wrapper library called PySpark that allows creating and running Apache Spark jobs in python. (4) It is scalable so we can add more nodes when it is required. We have created a single node Apache Spark cluster as the first phase of deployment and we can add other slave nodes when it is required. In addition, we have use pandas9 python library because it provides many classes and functions that makes handling data easier. For Data visualization, we used matplotlib 10 library which is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. It also allows Figure 2 - Starting Hadoop services drawing real-time graphs. B. RaDEn Alarm System The data processing and visualization is done by a python script, where we have implemented the radiation alert system (designed based on the Lebanese Atomic Energy Commission requirements) that work as the following: (1) First, the user must define a threshold value. (2) The radiation average is calculated based on the last 30 day from the current date. (3) Then, when the current radiation value is higher than the sum of the radiation average and the Figure 3 -Starting Kafka services ause the rain increase the radiation level. (5) If an alert was raised (any level) and the radiation level is still high after 5 hours then a technician to visit the sensor location to do a checkup. IV. DEMONSTRATION OF RADEN In this section, we demonstrate RaDEn. For our demonstration we used a radiation dataset supplied by the department of environmental radiation control at the Figure 4 - Starting Spark services Lebanese Atomic Energy Commission (LAEC-CNRS). A. Dataset The dataset was provided by the LAEC-CNRS, is in form of flat files, because accessing the sensors or the web server (relational database) was not made due to confidentiality issues. The dataset contains the data collected from 2015-08-01 to 2016-08-01 from a testing sensor that was installed in Beirut. It contains information related to radiation such as: radiation level, temperature, rain level, Sensor battery power, data collection time and external battery power. B. Starting RaDEn Figure 5 - Starting Flume agent Before starting the process, the user must start the C. Data Ingestion following services: (1) Hadoop cluster (Figure 2), (2) To simulate data ingestion we have create a directory Apache Kafka service (Figure 3), (3) Apache Spark cluster where we must copy all flat files, and we created a terminal script that creates a listener on this folder. When any file is 8 https://spark.apache.org inserted, the script loops over the lines and send them one 9 https://pandas.pydata.org by one to the Apache Kafka producer. Once the data is sent 10 https://matplotlib.org to the Kafka producer, the Apache Flume agent sent it 91 directly to HDFS to a specific directory and the data is replicated on the three Hadoop data nodes. Figure 6 shows the data ingested into HDFS via Hadoop web interface. Also, figure 7 shows that the file stored in HDFS is replicated on 3 data nodes. Figure 8 - Real-time graph screenshots In addition, when an alert is raised it is shown in a message box where the alarm level is written in the title and the description is written in the body. In figure 9, we have showed the level 1 alert message box. Figure 6 - Files ingested into HDFS Figure 9 - Alarm Level 1 Messagebox E. Data Retrieval Based on the Flat files structure and for data retrieval purposes, we created an External Table using Apache Hive on the Directory location in HDFS. Then we created a view from this table to remove messy data such as duplicates rows and rows where dose rate is null. After created external table on HDFS directory, we can search among the data imported to HDFS using Spark-SQL or HiveQL console. We have to write a query based on our requirements. As example, we need to retrieve all the data where the radiation level is higher than 50. First, we need to run the Spark SQL console using the spark-sql command or hive command to start Hive console, and we can use the following query: Figure 7 - Stored file availability SELECT * FROM vw_radiation WHERE dose_rate > 50; D. Radiation Monitoring At the same time, the python script read the data from In this example, it shows that the number of rows Kafka consumer using the PySpark library, the alarm script returned is 10459 rows in 13.99 seconds as illustrated in is applied, and the radiation level is visualized on a real-time Figure 10. graph using matplotlib library. In figure 8, we have shown sequential snapshots of the real-time graph that was visualized during the experiments and it shows the radiation level changes in function of date and time. Figure 10 Apache Hive query results. V. CONCLUSION In this paper, we designed a solution called RaDEn that is able to handle a massive scale of data in real-time and 92 batch style. It allow user to process the radiation data international journal of professional engineering studies, vol. 9, no. 3, pp. 182-186, 2017. coming from different sources, predict any possible [8] Pablo Andrade Grossi, Leonardo Soares de Souza, Geraldo Magela radiation problems and visualize the data in real-time. In Figueiredo, Arthur Figueiredo, "Management Information System addition, it gives the user the ability to query and retrieve Applied to Radiation Protection Services," in 2013 International the data using a simple SQL-Like language. In addition, we Nuclear Atlantic Conference - INAC 2013, Brazil, 2013. [9] Kalpana.k, Shruti, Shweta, Bhagyasri, "An Integrated system For explained how we implemented this solution and we showed Regional Environmental Monitoring and Management Based on IoT," a real case scenario. Internatonal Journal of Information Technology and We tried to cover all the challenges we identified at the ComputerEngineering, no. 16, pp. 60-64. beginning of our work but unfortunately, the [10] Eran Vax, Benny Sarusi, Mati Sheinfeld, Shmuel Levinson, Irad Brandys, Danny Sattinger , Udi Wengrowicz, Avi Tshuva, Dan implementation we have made have some limitations due to Tirosh, "ERMS Environmental Radiation Monitoring System," Beer the following issues: (1) we received a very small dataset Sheva, Israel. that cannot be considered as Big Data while our system is [11] -Based Radiation designed to handle Big Data. (2) We did not get permissions Monitoring and Warning System," Romania. [12] Xiaoyu Wanga, Zhaoguo Wang, Liyuan Xu, Deyun Chen, "Wireless to access the sensors or the databases. (3) There is a lack of Communications Radiation Monitoring System Based on ZigBee and documentation for Big Data technologies. (4) The GPRS," Advanced Materials Research, Vols. 403-408, no. 1662- research time limit. 8985, pp. 1956-1959, 2012. A list of works is lined up to be done in future. More [13] Camelia Avram, Silviu Folea, Dan Radu and Adina Astilean , "Wireless Radiation Monitoring System," in European Conference on powerful tools such as bokeh11 and Kibana12 can be used to Modelling and Simulation, Romania. increase performance and more options to the end-user and [14] Cheng-Jian, Z., Xian-Hua, L., Xiang-Yong, S., & Qing-Zhou, L., are able to draw a huge number of real-time graphs at the "Analysis on the correlation of atmospheric path radiation and air same time. In addition, adding some user interface will pollution index," in Urban Remote Sensing Event, 2009. [15] Baker, C., Davidson, G., Evans, T. M., Hamilton, S., Jarrell, J., & make this this solution more powerful, because the current Joubert, W., "High performance radiation transport simulations: implementation is not user friendly; due to the lack of user preparing for Titan," in International Conference on High interface, also it requires a good knowledge in SQL to be Performance Computing, Networking, Storage and Analysis, 2012. able to retrieve data from HDFS. In addition, the solution [16] Jeong, M. H., Sullivan, C. J., & Wang, S., "Complex radiation sensor network analysis with Big Data analytics," in In Nuclear Science should be extended enhance the user the ability to visualize Symposium and Medical Imaging Conference, 2015. the results of queries on different types of graphs. Furthermore, we stored data as flat files in HDFS, to improve the performance; we could create an automate job that run periodically and move new data to another HDFS location and convert it to Optimized Row Columnar (ORC) files which gives faster results. Also in future extension, distributed search engines such as Elasticsearch and Solr 13 can be used for data retrieval process. REFERENCES [1] "Alpha, Beta, Gamma, X-Ray, and Neutron Radiation," Mirion technologies, [Online]. Available: https://www.mirion.com/introduction-to-radiation-safety/types-of- ionizing-radiation. [Accessed 17 September 2018]. [2] "Ionising Radiation and Human Health," Australian government - department of health, 07 December 2012. [Online]. Available: http://www.health.gov.au/internet/publications/publishing.nsf/Content /ohp-radiological-toc~ohp-radiological-05-ionising. [Accessed 17 September 2018]. [3] "Wireless Sensor Networks to Control Radiation Levels," Libelium, 19 April 2011. [Online]. Available: http://www.libelium.com/wireless_sensor_networks_to_control_radia tion_levels_geiger_counters. [Accessed 17 September 2018]. [4] C. Prakash, "Spark Streaming vs Flink vs Storm vs Kafka Streams vs Samza : Choose Your Stream Processing Framework," 21 March 2018. [Online]. Available: https://www.linkedin.com/pulse/spark- streaming-vs-flink-storm-kafka-streams-samza-choose-prakash/. [Accessed 10 9 2018]. [5] Hsin-Fa Fang, Jeng-Jong Wang, Ing-Jang Chen and Jih-Hung Chiu, "The application of GPS, GIS and GPRS in Environmental Radiation Survey," Taiwan. [6] G. Segura Millan, D. Perrin, L. Scibile, "RAMSES: The LHC Radiation Monitoring System for the Environment and Safety," in 10th ICALEPCS Int. Conf. on Accelerator & Large Expt. Physics Control Systems, Geneva, 2005. [7] L. R. NAIK, "An Integrated System for Regional Environmental Monitoring and Management Based on Internet of Things," 11 https://bokeh.pydata.org/ 12 https://www.elastic.co/products/kibana 13 http://lucene.apache.org/solr/ 93