ORADIEX: A Big Data driven smart framework for real-time surveillance and analysis of individual exposure to radioactive pollution Hadi Fadlallah Yehia Taher Rafiqul Haque Lebanese University Université de Versailles – Paris-Saclay Intelligencia R&D Beirut, Lebanon Versailles, France Paris, France Hadi.Fadlullah@gmail.com yehia.taher@uvsq.fr Rafiqul.Haque@intelligencia.fr Ali Jaber Lebanese University Beirut, Lebanon ali.jaber@ul.edu.lb Abstract— Radiation pollution has been always a critical small wireless sensors, mobile phones, smart watches. In concern, since it can cause a huge damage to humans and for addition, the data management technologies are improved in nature. To minimize the damage, governments are collecting frequently to be able for handling the data sources growth. and monitoring radiation level using advanced systems. In the To be able to handle data coming from multiple data sources past years, Big data technologies such as distributed file in real-time, radiation monitoring systems must adopt the systems, NoSQL databases and stream processing technologies was implemented in the radiation monitoring systems to new data technologies and need to be improved periodically. improve their abilities to handle huge volume of data coming from different sources in a high speed. As Big data technologies Several data engineering systems that relies on new data are being improved frequently to handles the fast growth of the technologies such as NoSQL databases, distributed file data, these systems need to be updated and improved system were proposed in literature such as [3][4][5][6][7] periodically to adopt new technologies and to guarantee a and other solutions. These solutions have two main higher control over radiation exposure. In this paper, we limitations that (1) they cannot handle a huge volume of proposed a system called ORADIEX which is an improvement data in real-time, (2) fault-tolerance and scalability are not of our previous published work RaDEn [2]. It has the ability to (1) reading data from sensors and different sources, (2) always guaranteed. As earlier, we proposed a radiation processing data in real-time, (3) stores raw radiation data as it engineering system called RaDEn [2] which is built using comes from sources, (4) clean data and stores it in a time-series Big Data technologies such as Hadoop 1 distributed file database, (5) visualize and monitor data in real-time, (6) send system, and real-time data ingestion tools, this system alert when a high radiation level is detected and (7) allow solved the problem related to reading and storing huge performing advanced data retrieval operations over raw and radiation data but it still have many limitations as it cannot processed data. In addition, this system was implemented and (1) visualize data from different sources, (2) notification tested using a real dataset provided by the Lebanese Atomic system was not implemented,(3) historical data cannot be Energy Commission (LAEC-CNRS). visualized since it is saved in raw format,(4) historical alert information are not stored, (5) real-time graph is very basic Keywords—Radiation, data engineering, Big Data, radiation monitoring, real-time processing and shows only last 30 measurements,(6) cleaned and processed data was visualized without being saved and (7) it doesn't have a user friendly interface. I. INTRODUCTION In this paper, we are proposing a radiation monitoring Preventing and controlling radioactive exposures still one system called ORADIEX were we improved the old system of the most critical duties of governments and researchers, RaDEn [2] by (1) adding a distributed time-series NoSQL since it has a catastrophic effect on every living beings [1]. database (InfluxDB 2) to store data after being cleaned and Prevention activities can be classified into three main processed, (2) adopting a powerful real-time monitoring categories: (1) physical protection, (2) radiation monitoring framework (Grafana 3) that has a user friendly interface and and (3) handling exposures. allows drawing real-time graphs from different sources, Radiation monitoring is considered as the most designing dashboards, visualizing data already stored, challenging part, since it requires building intelligent systems saving historical information about exposures and sending that are able to collect, analyze, visualize and raise alert email alerts when a radiation exposure detected. when an exposure is detected. 1 Due to the fast technology growth, collecting radiation data http://hadoop.apache.org 2 can be done from a wider variety of data sources such as http://influxdata.com 3 http://grafana.com Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 52 The rest of this paper is organized as follows. In Section 2, ™Processed Data Storage: This layer consists of a we briefly introduce our solution called ORADIEX. The distributed and scalable data warehouse that has the ability data processing is described in section 3. The development to store a time series data having different attributes. of ORADIEX will be detailed in Section 4. Section 5 ™Data Visualization: This layer allows user to create demonstrates ORADIEX. We conclude our work in Section dashboards that can visualize newly inserted data into the 9. processed data storage layer in real-time. Also, it gives the II. AN OVERVIEW OF ORADIEX ability to perform data retrieval operations and to send email notifications when a radiation exposure is detected. ORADIEX is a data engineering platform that has the ability to read huge amounts of data from multiple sources III. DATA PROCESSING with different formats using a scalable and distributed message broker, clean and process the collected data, store Since data comes from different data sources, the data within a scalable storage, visualize data in real-time incoming data quality must be assessed and improved. In the graphs and raise alert when a high radiation level is detected. data processing layer, we implemented a simple data cleansing and quality assurance process using the following ORADIEX stores data in its raw format within a scalable and fault-tolerant data lake to insure data governance, and steps: stores processed and cleansed data in JSON format within a 1. The measurement date is validated; if it is not a valid date scalable NoSQL time-series database that allows user to time value the data is rejected, else it will be converted to perform data retrieval operations. a universal date time format (yyyy-MM-ddTHH:mm:ss). 2. The other measurements are validated; if any ORADIEX can handles data caught by sensors in real- measurement cannot be parsed to a numeric value it will time, also it can handle data from other sources such as be removed. databases or flat files. 3. All empty strings are replaced with NULLs. ORADIEX architecture is composed of 6 layers as 4. Data is converted to standard format (JSON4), to be stored shown in Figure-1: in the processed data storage layer. IV. DEVELOPMENT OF ORADIEX Each layer of ORADIEX was deployed on a separate virtual machine where Ubuntu 5 16.04 LTS was used as operating system. For data ingestion, we have used one virtual machine where we installed and configured an Apache Kafka 6 broker to read data from different sources. Beside of the message broker, we installed an Apache Flume 7 agent to send data to the data lake directly when it is received by the message broker. In addition, we installed Apache Sqoop 8 on the ingestion layer, to give the user the ability to import archival data from relational databases directly into the data lake. Figure 1 - ORADIEX architecture We have chosen these technologies since they all guarantee a high scalability and fault-tolerance. ™Data Sources: data sources consist of data generated from For the raw data storage, we have built a 4-node Hadoop sensors, or data stored within flat files or relational databases 3.1.0 cluster where we configured one virtual machine (node) to act as a master node and the others as slaves (data ™Data Ingestion: This layers consist of a scalable message nodes). We have set the replication factor to 3, so when data broker and other ingestion tools that allows reading data is sent to the Hadoop master node it will be replicated in all 3 generated from the different sources and send it at the same data nodes. The Hadoop Distributed File System (HDFS) time to the data processing and raw data storage layers. guarantee a high level of scalability and fault-tolerance. ™Raw Data Storage: This layer consists of a scalable data In addition, we have installed Apache Hive9 to store the lake built on the top of a distributed file system where data metadata so it can be used to perform data retrieval is stored as it comes from the sources without editing. operations using SQL-like language over the raw data. Beside of the data lake, metadata is stored in a metadata For the data processing, we have deployed a single node repository that allows users to perform data retrieval Apache Spark cluster on a separate virtual machine. Apache operations. Spark is a distributed, scalable and fault-tolerant processing ™Data Processing: This layer relies mainly on a distributed data processing framework that allows processing huge volume of data. In this layer, data are cleansed and 4 http://json.org transformed into JSON format to be stored within the 5 http://ubuntu.com processed data storage layer. 6 http://kafka.apache.org 7 http://flume.apache.org 8 http://sqoop.apache.org 9 http:/ /hive.apache.org 53 framework, that has the ability to process data at rest and in A. Dataset real-time. Accessing the sensors or the web server (relational To implement the processing logic, we coded a Python10 database) was not made due to confidentiality issues. The script that uses PySpark library which is a wrapper of Spark. dataset was provided as flat files with data collected from The script listens to the message broker and read newly 2015-08-01 to 2016-08-01 from a testing sensor that was added data, then it filters the bad rows as described in the installed in Beirut. The data set structure is described in data cleansing section. After ensuring the data quality, the Table-1. data is transformed to a JSON format to be stored in the processed data storage layer. Column name Data Type Unit Description To store the processed data, we used a time-series Measurement_ti Datetime Measurement date NoSQL database called InfluxDB which guarantee a high me and time scalability.The reason for using a time-series database is that dose_rate Numeric nSv/h The radiation dose the main key in the data we are working on is the date and rate time of measurement. Temperature Numeric C Temperature The data is stored in JSON format within InfluxDB. Each Rain_Level Numeric mm/h The rain level JSON value is composed of 4 parts as shown Figure-2: Sensor_battery_ Numeric mV The sensor power internal battery power External_batter Numeric mV The sensor y_power external battery power Station_Name Text The sensor station name Table 1 - Data set structure B. Starting and Configuring ORADIEX First of all, we started all virtual machines (Ingestion, Processing, Raw Storage, Processed data storage). We started the following services: • Apache Kafka, Apache Flume services on the Ingestion Figure 2 – Processed data JSON structure machine. • Hadoop Cluster (Name node and data nodes) on the Raw • Measurement: The name of the table where data is stored data storage machines. • Apache Spark Cluster on the data processing machine. • Time: The date and time of the measurement • The python script on the data processing machine. • Fields: A list of values that can be visualized (rain level, • InfluxDB and Grafana Services on the monitoring radiation level, …) machine. To simulate data ingestion from sensors, we have created a • Tags: A list of values that can be used to filter data (station directory where we must copy data set provided, and we name) created a terminal script that creates a listener on this folder. For visualization, we have installed a tool called Grafana When any file is inserted, the script loops over the lines and used for real-time monitoring. It allows designing send them one by one to the Apache Kafka producer. dashboards to visualize data, and querying the data stored within the InfluxDB. In addition to this , it allows defining a Using Grafana, we created a dashboard that contains one radiation level limit for each graph (we can define one for graph that visualize the radiation dose rate, the rain level each station since the radiation level is affected by the and the temperature data received from only Beirut Station weather and temperature factors which differs between and we set the radiation limit to 45 as shown in Figure-3. locations) and to send email alert when this limit is reached. Grafana was installed on same machine of InfluxDB to guarantee a real-time visualization. V. DEMONSTRATION OF ORADIEX In this section, we demonstrate ORADIEX. For our demonstration we used a radiation dataset supplied by the department of environmental radiation control at the Lebanese Atomic Energy Commission (LAEC-CNRS). Figure 3 - Configuring Alert 10 http://python.org 54 Moreover, we have configured the email notification D. Retrieving Data settings where you can add many recipients and write the As shown in Figure-7, we can perform data retrieval custom message you want as shown in the Figure-4. operations from the InfluxDB database using Grafana interface, and the result is visualized as a graph. Figure 4 - Configring Email notification C. Radiation Monitoring Figure 7 - Data Retrieval from InfluxDB using Grafana After starting and configuring ORADIEX, we copied the data set to the ingestion directory. The data was visualized in real-time on the dashboard we created (Figure-5). VI. CONCLUSION In this paper, we designed a solution called ORADIEX which is an improved version of our previous work RaDEn [2]. In this version, we added a NoSQL database that stores processed data as a time-series, and we replaced the old visualization tool (Matplotlib python library) by a powerful real-time monitoring tool called Grafana that has a user friendly interface and allows real-time monitoring, data retrieval and sending notification when a radiation exposure occurs. We tried to cover all the limitations we identified in RaDEn at the beginning of our work but unfortunately, the implementation we have made have some limitations due to the following issues: (1) We did not get permissions to Figure 5 - Monitoring Dashboard access the sensors or the databases. (2) The research time limit. In addition, notification email was received when A list of works is lined up to be done in future. We can radiation level as exceeded at the same time all alert was enrich the data by integrating the free weather data offered recorded in the dashboard alert list as shown in Figure-6. by online API's. Also, we can benefit from search engines such as Solr 11, Elastic Search 12 to perform data retrieval operations from raw data. REFERENCES [1] "Ionising Radiation and Human Health," Australian government - department of health, 07 December 2012. [Online]. Available: http://www.health.gov.au/internet/publications/publishing.nsf/Content /ohp-radiological-toc~ohp-radiological-05-ionising. [Accessed 17 September 2018]. [2] Fadlallah H., Taher Y., Jaber A., "RaDEn: A Scalable and Efficient Radiation Data Engineering", in International conference of Big Data and Cyber Security Intelligence, 2018. [3] Avram C., Folea S., Dan Radu & Astilean A., "WIRELESS RADIATION MONITORING SYSTEM," in European Conference on Modelling and Simulation, Romania. [4] Baker, C., Davidson, G., Evans, T. M., Hamilton, S., Jarrell, J., & Figure 6 - Alert list Joubert, W., "High performance radiation transport simulations: preparing for Titan," in International Conference on High Performance Computing, Networking, Storage and Analysis, 2012. [5] Jeong, M. H., Sullivan, C. J., & Wang, S., "Complex radiation sensor network analysis with big data analytics," in In Nuclear Science Symposium and Medical Imaging Conference, 2015. 11 http://lucene.apache.org/solr 12 http://elastic.io 55 [6] Liao, T. S., Wu, C. C., Chou, C. C., Hwang, C. H., Tang, Y. W., Tsai, D. P., & Chen, T. Y., "Simplified algorithm of ionizing radiation detecting based on image sensor," in Instrumentation and Measurement Technology Conference Proceedings, 2016. [7] Kim, K. S., Kojima, I., Suzuki, R., Naito, W., & Ogawa, H., "RALFIE: a life-logging system for reducing potential radiation exposures," in the 1st ACM SIGSPATIAL International Workshop on the Use of GIS in Emergency Management, 2015. 56