Data Mining and Visualization: Meteorological Parameters and Gas Concentration Use Case © Yas A. Alsultanny Arabian Gulf University Manama, Kingdom of Bahrain alsultanny@hotmail.com Abstract. Knowledge extraction from big data is one of the important subjects now and in future. Mining in the big data needs many steps, which must be implemented very carefully. The final step in big data mining is visualizing the results or summarizing the results numerically. This paper aims to mining the big data recorded by environmental station. These stations are recording the concentrations of some gases and meteorological parameters. The 2D and 3D data visualization is used to evaluate the capability of visualization in determining the effect of meteorological parameters on some gases that caused pollution. The results showing the visualization is a very important tool, and visualization can be used in mining big data, by showing the concentrations of gases. The paper recommends using big data visualization periodically as an alarming tool for monitoring the levels of pollution gases concentration. Keywords: metrological parameters, gases concentration, filtering; preprocessing, decision tree, meteorological parameters. 1 Introduction described Data Visualization as; “fortunately, we humans are intensely visual creatures. Few of us can Big Data Mining (BDM) and Data Visualization detect patterns among rows of numbers, but even young (DV) are two important hot topics in the field of children can interpret bar charts, extracting meaning knowledge discovery. The big data can be visualized and from those numbers’ visual representations. Visualizing analyzed to extract knowledge. The visual analytical data is the fastest way to communicate it to others”. tools have steadily improved during the last years in Air pollution is important in our life; most of the order to work with big data. The data collected from pollutants in the air are a result of emissions from cars, different resources, such as the station for monitoring trucks, buses, factories, refineries, and other sources. The pollution gases. These stations usually have an hourly objective of this paper is to highlight the aspects of Big readings to measure concentrations of gases such as; Data miming to visualize air pollution concentrations ozone O3, nitrogen dioxide NO2, sulfur dioxide SO2, and it is relative to meteorological parameters. carbon monoxide CO, carbon dioxide CO2, particulate matter (PM10 and PM2.5), moreover these stations have 2 Literature Review hourly readings for meteorological parameters such as; Big data rises with the huge growth of data. It refers Temperature (Temp), Humidity (Hu), Wind Speed (WS), to the storing, processing, and analyzing the vast Wind Direction (WD), and Air Pressure (AP). amounts of data. Big data brings new challenges to Big data is a term used to describe some of current visualization because of the speed, size and diversity of directions in information technology, as a concept that data. One of the most common definitions of big data is take into consideration data analysis. The amount of data data that have volume, variety, and velocity [7-9]. The in the world is huge, and it grows in an annual basis of term “Big Data” is surrounded by a lot of advertising, 50% of its original size [1]. It is important to note that where many software vendors claim to have the ability most of the big data is unstructured data, where it is not to handle big data with their products [10]. Innovations organized and does not fit the usual databases [2]. Big in hardware technology such as those in network data can be used as a useful tool to enhance decision bandwidth, memory, and storage technology have making [3]. assisted the technology of Big Data. The new innovations Data Mining is the technique to get useful knowledge coupled with the latent need to analyze the massive out of databases; data mining requires pre-processing unstructured data that stimulated their development [11]. and analytic approach for finding the value. Data mining Data Mining is the field of discovering novel and requires many operations such as data integration, data potentially useful information from large amounts of selection, and so on [4]. data [12]. Data mining defined as the use of analytical Visual analytic first defined by Tomas and Cook in tools to discover knowledge in a database. The analytical 2005 [5] as; the science of analytical reasoning facility tools may include machine learning, statistics, artificial by interactive visual interface. Murray in 2013 [6] intelligence, and information visualization [13]. Data mining categorized into seven categories as Fayyad et al. Proceedings of the XIX International Conference in 1996 [14] stated. These categories are regression, “Data Analytics and Management in Data Intensive clustering, summarization, dependency modeling, link Domains” (DAMDID/RCDL’2017), Moscow, Russia, analysis, and sequence analysis. Knowledge Discovery October 10-13, 2017 350 in Databases (KDD) is the processing steps used to after data filtering and preprocessing, the data for one extract useful information from large collections of data year 2015 was analyzed in this paper. The data [15]. Data mining manly has two methods: classification represented on an hourly averaged reading, where the is assigns items in a collection to target categories or yearly readings for each gas or parameter must be 8,760 classes, and clustering is a form of unstructured learning (24 hr*365 day), but the real readings after filtering and method. Decision trees are types of classifications such processing are 8,630, with 130 (1.5%) missed reading. as: Reduced Error Pruning (REP) tree, K Nearest The Rapidminer version 7.5 was used for processing and Neighbors (KNN), the J48 based on C4.5 algorithm, and visualization the data of this paper. M5P algorithm is an improvement of the Quinlan’s M5 Figure 2 shows the effect of temperature on the algorithm [16-20]. concentration of the five gases (O3, NO2, CO, CO2, and “To visualize” has two meanings. “To form a mental SO2) and PM10. The figure visualizes the data image of something” refers to a cognitive, internal aspect distribution by using two-dimensional diagrams; the whereas “to make something visible to the eye” refers to temperature has an opposite effect on O3 and NO2. The an external, perceptual role [21]. Visualization is any concentration of O3 increased directly during the hottest kind of technique to present information [22-23]. Data hours, when the temperature was above 40ºC. While the visualization refers to any graphic representation that can temperature had a reverse effect on NO2, the examine or communicate the data in any discipline [24]. concentration of this gas became lower during the hottest hours, and its concentration was in its lightest levels, The 3D visualization is gradually becoming the main when the temperature was less than 10Cº. The effect of trend in many fields including population gases and temperature on CO and CO2 is very limited and this is meteorological parameters [25]. clear from the figure, this indicates the temperature has 3 Data Visualization no effect on these two gases. The hottest hours have a direct effect on SO2 and PM10, their concentrations This study proposes a visualization method to usually increased during summer and especially in the represent graphically air pollution big data, to be an hottest hours of a day. efficient method for knowledge discovery. This visual Figure 3 shows the effect of humidity on the five methodology is useful for people who are working in gases and PM10. The humidity has a reverse effect on O3 field of air pollution to have an efficient readability and and NO2, their concentrations are increased with lower accuracy of data analysis. Data visualization is the use of concentration of humidity, moreover the concentrations computer for visual representations of data. It aims at of CO, CO2, and SO2 increased with lower percentage of helping decision maker to detect effectively into big data. humidity. The PM10 concentration significantly reduced, Data visualization is an efficient and intuitively when the humidity percentage was higher than 70%. accessible approach to identify patterns in large and These results are true, because the highest percentages of diverse data sets. humidity, reducing the five gases and PM10 disperse. Gases and metrological parameters visualizations can Figure 4 shows the three dimensions scatter diagrams to visualize the effect of both temperature and have two goals: Explanatory and Exploratory. Gases and humidity at the same time on the five gases and PM10. metrological parameters data are usually recorded by The figure shows again most of the readings of O3 are automatic stations at regular time intervals. Metrological concentrated in the region of hottest temperature and low data is typically multivariate that often consists of many percentage of humidity. The concentrations of NO2 dimensions. Air pollution is a major concern in any city increased at the lowest temperature and humidity. For through the world. The visualization technique is used to CO, CO2, SO2, and PM10 their readings are concentrated aid visual analysis of the air pollution problem, followed in the region of hottest temperature and low percentage by metrological data for knowledge discovery. of humidity. There are many steps must be taken in order to A decision tree is a predictive model [26]. It was prepare data for visualization, these steps are shown in implemented in this paper to predicate PM10, which is Figure 1. The steps are: stations sensors adjustment, data measured in part per million (ppm), by stating the effect recording, data filtering, data preprocessing, of temperature and wind speed. To implement the normalization, aggregation, and visualization. decision tree the PM10, temperature, and humidity were Stations Sensors Adjustment classified into: 0=0-50, 1=51-150, 2=151-400, 3=401- Data Recording 700, 4=701-1000, 5=1001-1500, 6=1501-2500, 7=2501 Data Filtering and more. The temperature in centigram degree (Cº) Data Preprocessing classified into: 0=0-6, 1=7-11, 2=12-16, 3=17-21, 4=22- Normalization 26, 5=27-35,6=36-46, 7=47 and more. The wind speed Aggregation meter per second (m/s) classified into: 0=0-2, 1=3-5, Visualization 2=6-8, 3=9-12, 4=13 and more. The decision rules of the decision tree to predicate PM10, as an example by using Figure 1 Big data acquisition and utilization temperature and wind speed-readings are as follows. 4 Data Collection and Analysis The data available for this paper were collected from Arabian Gulf countries from one station in state of Kuwait; it was hourly time series data for eleven years, 351 O3 NO2 CO CO2 SO2 PM10 Figure 2 Effect of temperature on the five gases and PM10 O3 • NO2 • CO • • • CO22 • SO22 • PM10 • • • • Figure 3 Effect of humidity on the five gases and PM10 O3 • NO2 • CO • • • CO2 • SO2 • PM10 • • • • Figure 4 Effect of temperature and humidity on the five gases and PM10 It shows when wind speed between 6-12m/s and 7=3, 4=23, 5=16, 6=10} temperature 22=35 Cº, the PM10 will be between 151-400 ppm. 5 Conclusion Tree The problems of storing and analysis of big data are WS > 3.500: 2 {1=0, 0=0, 2=2, 3=2, 7=0, 4=0, 5=1, 6=0} facing all the organization through the world, especially WS ≤ 3.500 the environmental organizations taking interest in | WS > 2.500 monitoring pollution gases. These organizations have | | Temp > 4.500: 2 {1=15, 0=0, 2=28, 3=8, 7=0, one or more online reading stations installed near 4=4, 5=4, 6=5} industrial cities and oil refinery stations. | | Temp ≤ 4.500: 1 {1=41, 0=11, 2=16, 3=0, 7=2, 4=0, 5=2, 6=3} Using the 2D and 3D scatter diagram to visualize | WS ≤ 2.500: 1 {1=4802, 0=1472, 2=634, 3=77, the data reading is one of the important tools. That can be used by decision makers to explore the concentration 352 of pollutant gases and effect of meteorological directions. Proceedings of the First International parameters, by using these types of visualization the Conference on Data Mining and Big Data, Bali. decision makers can take their decision in stopping or Indonesia, pp 3-14, June 25-30 (2016) reducing the working hours of the factories or refinery [13] Redpath, R.: A comparative study of visualization stations that cause the major pollution. techniques for data mining. MSc thesis. School of We recommend each factory of refinery, using the Computer Science and Software Engineering, same methods of visualizing the pollution gases to take Monash University, Australia (2000) their decision to stop their factory of refinery station or [14] Fayyad, U., Piatetsky-Shapiro, G., Smyth, P.: From reducing the hours of working hours, when the data mining to knowledge discovery in databases. temperature rises to more that 45Cº. American Association for Artificial Intelligence, pp. 37-54 (1996) References [15] Frawley, J. Piatetsky-Shapiro, G., Matheus, J.: [1] Gantz, J., Reinsel, D.: Extracting value from chaos. Knowledge discovery in databases: an overview; IDC IVIEW. knowledge discovery in databases. AAAI Press/ https://www.emc.com/collateral/analyst- The MIT Press, Menlo Park, California, USA reports/idc-extracting-value-from-chaos-ar.pdf (1991) [2] Lohr, S.: The age of big data. [16] Tan, P., Steinbach, M., Kumar, V.: Introduction to http://www.nytimes.com/2012/02/12/sunday- data mining. Pearson Addison Wesley (2006) review/big-datas-impact-in-the-world.html [17] Witten, I., Frank, E., Hall, M., Pal, C.:Data mining: [3] Shumway R.: One solution for air pollution: big practical machine learning tools and techniques. data. Elsevier Inc., 4th Edition (2017) http://www.deseretnews.com/article/865617771/O [18] Kantardzic, M.: Data mining: concepts, models, ne-solution-for-air-pollution-Big-data.html methods, and algorithms. Johm Wiley and Sons [4] Han, J., Kamber, M., Jian, P.: Data mining: Inc., 2nd Edition (2011) concepts and techniques. Elsevier Inc., (2012) [19] Masethe, M., Masethe, H.: Prediction of work [5] Thomas, J., Cook, J.: Illuminating the path: the integrated learning placement using data mining research and development agenda for visual algorithms. Proceedings of the World Congress on analytics. National Visualization and Analytics Engineering and Computer Science, San Francisco, Center (2005) USA, vol I, WCECS 2014, 22-24 October (2014) [6] Murray, S.: Interactive data visualization for the [20] Neeb, H., Kurrus, C.: Distributed K-nearest web. O’Reilly Media, Inc. (2013) neighbors. [7] http://www.sas.com/en_us/home.html https://stanford.edu/~rezab/classes/cme323/S16/pr ojects_reports/neeb_kurrus.pdf [8] De Mauro, A., Greco, M., Grimaldi, M.: Grimaldi [21] Oxford English Dictionary, Visualization. Oxford formal definition of big data based on its essential features. Journal of Library Review, vol. 65, no. 3, University Press (2009) pp. 122–135 (2016) [22] Chen, C., Hardle, W., Unwin, A.: Handbook of data [9] Dion, M., AbdelMalik, P., Mawudeku, A.: Big data visualization. Springer (2008) and the Global Public Health Intelligence Network [23] Keim, A., Mansmann, J., Thomas, S., Ziegler, H.: (GPHIN). vol. 41, pp. 209-219 (2015) Visual analytics: scope and challenges. Berlin, [10] Heudecker, N., Beyer, A., Laney, D., Cantara, M., Heidelberg, Springer-Verlag (2008) White, A., Edjlali, R., McIntyre, A.: Predicts 2014: [24] Few, S.: Now you see it: simple visualization big data. gartner insight. Gartner Research, techniques for quantitative analysis. Analytics Stanford, Connecticut (2013) Press, Oakland (2009) [11] Bhagattjee, B.: Emergence and taxonomy of big [25] NESSI.: Big data a new world of opportunities. data as a service. Working Paper CISL# 2014-06. White Paper (2012) Massachusetts Institute of Technology (2014) [26] Rokach, L., Maimon, O.: Data mining with decision [12] Cheng, S., Liu, Shi, Y., Jin, Y., Li, B.: Evolutionary trees: theory and applications. World Scientific computation and big data: key challenges and future Publishing (2008) 353