=Paper= {{Paper |id=Vol-2022/paper53 |storemode=property |title= Data Mining and Visualization: Meteorological Parameters and Gas Concentration Use Case |pdfUrl=https://ceur-ws.org/Vol-2022/paper53.pdf |volume=Vol-2022 |authors=Yas A. Alsultanny |dblpUrl=https://dblp.org/rec/conf/rcdl/Alsultanny17 }} == Data Mining and Visualization: Meteorological Parameters and Gas Concentration Use Case == https://ceur-ws.org/Vol-2022/paper53.pdf
             Data Mining and Visualization:
Meteorological Parameters and Gas Concentration Use Case
                                               © Yas A. Alsultanny
                                           Arabian Gulf University
                                         Manama, Kingdom of Bahrain
                                            alsultanny@hotmail.com
           Abstract. Knowledge extraction from big data is one of the important subjects now and in future.
    Mining in the big data needs many steps, which must be implemented very carefully. The final step in big
    data mining is visualizing the results or summarizing the results numerically. This paper aims to mining the
    big data recorded by environmental station. These stations are recording the concentrations of some gases
    and meteorological parameters. The 2D and 3D data visualization is used to evaluate the capability of
    visualization in determining the effect of meteorological parameters on some gases that caused pollution. The
    results showing the visualization is a very important tool, and visualization can be used in mining big data,
    by showing the concentrations of gases. The paper recommends using big data visualization periodically as
    an alarming tool for monitoring the levels of pollution gases concentration.
           Keywords: metrological parameters, gases concentration, filtering; preprocessing, decision tree,
    meteorological parameters.

1 Introduction                                                     described Data Visualization as; “fortunately, we
                                                                   humans are intensely visual creatures. Few of us can
    Big Data Mining (BDM) and Data Visualization                   detect patterns among rows of numbers, but even young
(DV) are two important hot topics in the field of                  children can interpret bar charts, extracting meaning
knowledge discovery. The big data can be visualized and            from those numbers’ visual representations. Visualizing
analyzed to extract knowledge. The visual analytical               data is the fastest way to communicate it to others”.
tools have steadily improved during the last years in                  Air pollution is important in our life; most of the
order to work with big data. The data collected from               pollutants in the air are a result of emissions from cars,
different resources, such as the station for monitoring            trucks, buses, factories, refineries, and other sources. The
pollution gases. These stations usually have an hourly             objective of this paper is to highlight the aspects of Big
readings to measure concentrations of gases such as;               Data miming to visualize air pollution concentrations
ozone O3, nitrogen dioxide NO2, sulfur dioxide SO2,                and it is relative to meteorological parameters.
carbon monoxide CO, carbon dioxide CO2, particulate
matter (PM10 and PM2.5), moreover these stations have              2 Literature Review
hourly readings for meteorological parameters such as;
                                                                       Big data rises with the huge growth of data. It refers
Temperature (Temp), Humidity (Hu), Wind Speed (WS),
                                                                   to the storing, processing, and analyzing the vast
Wind Direction (WD), and Air Pressure (AP).
                                                                   amounts of data. Big data brings new challenges to
    Big data is a term used to describe some of current
                                                                   visualization because of the speed, size and diversity of
directions in information technology, as a concept that
                                                                   data. One of the most common definitions of big data is
take into consideration data analysis. The amount of data
                                                                   data that have volume, variety, and velocity [7-9]. The
in the world is huge, and it grows in an annual basis of
                                                                   term “Big Data” is surrounded by a lot of advertising,
50% of its original size [1]. It is important to note that
                                                                   where many software vendors claim to have the ability
most of the big data is unstructured data, where it is not
                                                                   to handle big data with their products [10]. Innovations
organized and does not fit the usual databases [2]. Big
                                                                   in hardware technology such as those in network
data can be used as a useful tool to enhance decision
                                                                   bandwidth, memory, and storage technology have
making [3].
                                                                   assisted the technology of Big Data. The new innovations
    Data Mining is the technique to get useful knowledge
                                                                   coupled with the latent need to analyze the massive
out of databases; data mining requires pre-processing
                                                                   unstructured data that stimulated their development [11].
and analytic approach for finding the value. Data mining
                                                                       Data Mining is the field of discovering novel and
requires many operations such as data integration, data
                                                                   potentially useful information from large amounts of
selection, and so on [4].
                                                                   data [12]. Data mining defined as the use of analytical
    Visual analytic first defined by Tomas and Cook in
                                                                   tools to discover knowledge in a database. The analytical
2005 [5] as; the science of analytical reasoning facility
                                                                   tools may include machine learning, statistics, artificial
by interactive visual interface. Murray in 2013 [6]
                                                                   intelligence, and information visualization [13]. Data
                                                                   mining categorized into seven categories as Fayyad et al.
Proceedings of the XIX International Conference                    in 1996 [14] stated. These categories are regression,
“Data Analytics and Management in Data Intensive                   clustering, summarization, dependency modeling, link
Domains” (DAMDID/RCDL’2017), Moscow, Russia,                       analysis, and sequence analysis. Knowledge Discovery
October 10-13, 2017



                                                             350
in Databases (KDD) is the processing steps used to                  after data filtering and preprocessing, the data for one
extract useful information from large collections of data           year 2015 was analyzed in this paper. The data
[15]. Data mining manly has two methods: classification             represented on an hourly averaged reading, where the
is assigns items in a collection to target categories or            yearly readings for each gas or parameter must be 8,760
classes, and clustering is a form of unstructured learning          (24 hr*365 day), but the real readings after filtering and
method. Decision trees are types of classifications such            processing are 8,630, with 130 (1.5%) missed reading.
as: Reduced Error Pruning (REP) tree, K Nearest                     The Rapidminer version 7.5 was used for processing and
Neighbors (KNN), the J48 based on C4.5 algorithm, and               visualization the data of this paper.
M5P algorithm is an improvement of the Quinlan’s M5                       Figure 2 shows the effect of temperature on the
algorithm [16-20].                                                  concentration of the five gases (O3, NO2, CO, CO2, and
    “To visualize” has two meanings. “To form a mental              SO2) and PM10. The figure visualizes the data
image of something” refers to a cognitive, internal aspect          distribution by using two-dimensional diagrams; the
whereas “to make something visible to the eye” refers to            temperature has an opposite effect on O3 and NO2. The
an external, perceptual role [21]. Visualization is any             concentration of O3 increased directly during the hottest
kind of technique to present information [22-23]. Data              hours, when the temperature was above 40ºC. While the
visualization refers to any graphic representation that can         temperature had a reverse effect on NO2, the
examine or communicate the data in any discipline [24].             concentration of this gas became lower during the hottest
                                                                    hours, and its concentration was in its lightest levels,
The 3D visualization is gradually becoming the main
                                                                    when the temperature was less than 10Cº. The effect of
trend in many fields including population gases and
                                                                    temperature on CO and CO2 is very limited and this is
meteorological parameters [25].                                     clear from the figure, this indicates the temperature has
3 Data Visualization                                                no effect on these two gases. The hottest hours have a
                                                                    direct effect on SO2 and PM10, their concentrations
    This study proposes a visualization method to                   usually increased during summer and especially in the
represent graphically air pollution big data, to be an              hottest hours of a day.
efficient method for knowledge discovery. This visual                     Figure 3 shows the effect of humidity on the five
methodology is useful for people who are working in                 gases and PM10. The humidity has a reverse effect on O3
field of air pollution to have an efficient readability and         and NO2, their concentrations are increased with lower
accuracy of data analysis. Data visualization is the use of         concentration of humidity, moreover the concentrations
computer for visual representations of data. It aims at             of CO, CO2, and SO2 increased with lower percentage of
helping decision maker to detect effectively into big data.         humidity. The PM10 concentration significantly reduced,
Data visualization is an efficient and intuitively                  when the humidity percentage was higher than 70%.
accessible approach to identify patterns in large and               These results are true, because the highest percentages of
diverse data sets.                                                  humidity, reducing the five gases and PM10 disperse.
    Gases and metrological parameters visualizations can                  Figure 4 shows the three dimensions scatter
                                                                    diagrams to visualize the effect of both temperature and
have two goals: Explanatory and Exploratory. Gases and
                                                                    humidity at the same time on the five gases and PM10.
metrological parameters data are usually recorded by
                                                                    The figure shows again most of the readings of O3 are
automatic stations at regular time intervals. Metrological          concentrated in the region of hottest temperature and low
data is typically multivariate that often consists of many          percentage of humidity. The concentrations of NO2
dimensions. Air pollution is a major concern in any city            increased at the lowest temperature and humidity. For
through the world. The visualization technique is used to           CO, CO2, SO2, and PM10 their readings are concentrated
aid visual analysis of the air pollution problem, followed          in the region of hottest temperature and low percentage
by metrological data for knowledge discovery.                       of humidity.
    There are many steps must be taken in order to                        A decision tree is a predictive model [26]. It was
prepare data for visualization, these steps are shown in            implemented in this paper to predicate PM10, which is
Figure 1. The steps are: stations sensors adjustment, data          measured in part per million (ppm), by stating the effect
recording, data filtering, data preprocessing,                      of temperature and wind speed. To implement the
normalization, aggregation, and visualization.                      decision tree the PM10, temperature, and humidity were
                 Stations Sensors Adjustment                        classified into: 0=0-50, 1=51-150, 2=151-400, 3=401-
                        Data Recording                              700, 4=701-1000, 5=1001-1500, 6=1501-2500, 7=2501
                         Data Filtering                             and more. The temperature in centigram degree (Cº)
                     Data Preprocessing                             classified into: 0=0-6, 1=7-11, 2=12-16, 3=17-21, 4=22-
                       Normalization                                26, 5=27-35,6=36-46, 7=47 and more. The wind speed
                        Aggregation                                 meter per second (m/s) classified into: 0=0-2, 1=3-5,
                       Visualization                                2=6-8, 3=9-12, 4=13 and more. The decision rules of the
                                                                    decision tree to predicate PM10, as an example by using
   Figure 1 Big data acquisition and utilization                    temperature and wind speed-readings are as follows.

4 Data Collection and Analysis
   The data available for this paper were collected from
Arabian Gulf countries from one station in state of
Kuwait; it was hourly time series data for eleven years,




                                                              351
                O3                          NO2                              CO




                 CO2                        SO2                              PM10




Figure 2 Effect of temperature on the five gases and PM10

                O3            •             NO2             •                CO              •
                              •
                              •


                CO22          •             SO22            •                PM10        •
                              •                             •
                              •


Figure 3 Effect of humidity on the five gases and PM10

                       O3         •          NO2            •             CO         •
                                  •
                                  •

                       CO2        •          SO2            •             PM10       •
                                  •                         •
                                                            •


Figure 4 Effect of temperature and humidity on the five gases and PM10

   It shows when wind speed between 6-12m/s and                 7=3, 4=23, 5=16, 6=10}
temperature 22=35 Cº, the PM10 will be between 151-400
ppm.                                                            5 Conclusion
   Tree                                                              The problems of storing and analysis of big data are
WS > 3.500: 2 {1=0, 0=0, 2=2, 3=2, 7=0, 4=0, 5=1, 6=0}          facing all the organization through the world, especially
WS ≤ 3.500                                                      the environmental organizations taking interest in
   | WS > 2.500                                                 monitoring pollution gases. These organizations have
   | | Temp > 4.500: 2 {1=15, 0=0, 2=28, 3=8, 7=0,              one or more online reading stations installed near
4=4, 5=4, 6=5}                                                  industrial cities and oil refinery stations.
   | | Temp ≤ 4.500: 1 {1=41, 0=11, 2=16, 3=0, 7=2,
4=0, 5=2, 6=3}                                                       Using the 2D and 3D scatter diagram to visualize
   | WS ≤ 2.500: 1 {1=4802, 0=1472, 2=634, 3=77,                the data reading is one of the important tools. That can
                                                                be used by decision makers to explore the concentration




                                                         352
of pollutant gases and effect of meteorological                           directions. Proceedings of the First International
parameters, by using these types of visualization the                     Conference on Data Mining and Big Data, Bali.
decision makers can take their decision in stopping or                    Indonesia, pp 3-14, June 25-30 (2016)
reducing the working hours of the factories or refinery              [13] Redpath, R.: A comparative study of visualization
stations that cause the major pollution.                                  techniques for data mining. MSc thesis. School of
      We recommend each factory of refinery, using the                    Computer Science and Software Engineering,
same methods of visualizing the pollution gases to take                   Monash University, Australia (2000)
their decision to stop their factory of refinery station or          [14] Fayyad, U., Piatetsky-Shapiro, G., Smyth, P.: From
reducing the hours of working hours, when the                             data mining to knowledge discovery in databases.
temperature rises to more that 45Cº.                                      American Association for Artificial Intelligence,
                                                                          pp. 37-54 (1996)
References                                                           [15] Frawley, J. Piatetsky-Shapiro, G., Matheus, J.:
[1] Gantz, J., Reinsel, D.: Extracting value from chaos.                  Knowledge discovery in databases: an overview;
     IDC                                            IVIEW.                knowledge discovery in databases. AAAI Press/
     https://www.emc.com/collateral/analyst-                              The MIT Press, Menlo Park, California, USA
     reports/idc-extracting-value-from-chaos-ar.pdf                       (1991)
 [2] Lohr,      S.:   The      age      of      big    data.         [16] Tan, P., Steinbach, M., Kumar, V.: Introduction to
     http://www.nytimes.com/2012/02/12/sunday-                            data mining. Pearson Addison Wesley (2006)
     review/big-datas-impact-in-the-world.html                       [17] Witten, I., Frank, E., Hall, M., Pal, C.:Data mining:
 [3] Shumway R.: One solution for air pollution: big                      practical machine learning tools and techniques.
     data.                                                                Elsevier Inc., 4th Edition (2017)
     http://www.deseretnews.com/article/865617771/O                  [18] Kantardzic, M.: Data mining: concepts, models,
     ne-solution-for-air-pollution-Big-data.html                          methods, and algorithms. Johm Wiley and Sons
 [4] Han, J., Kamber, M., Jian, P.: Data mining:                          Inc., 2nd Edition (2011)
     concepts and techniques. Elsevier Inc., (2012)                  [19] Masethe, M., Masethe, H.: Prediction of work
 [5] Thomas, J., Cook, J.: Illuminating the path: the                     integrated learning placement using data mining
     research and development agenda for visual                           algorithms. Proceedings of the World Congress on
     analytics. National Visualization and Analytics                      Engineering and Computer Science, San Francisco,
     Center (2005)                                                        USA, vol I, WCECS 2014, 22-24 October (2014)
 [6] Murray, S.: Interactive data visualization for the              [20] Neeb, H., Kurrus, C.: Distributed K-nearest
     web. O’Reilly Media, Inc. (2013)                                     neighbors.
 [7] http://www.sas.com/en_us/home.html
                                                                          https://stanford.edu/~rezab/classes/cme323/S16/pr
                                                                          ojects_reports/neeb_kurrus.pdf
 [8] De Mauro, A., Greco, M., Grimaldi, M.: Grimaldi
                                                                     [21] Oxford English Dictionary, Visualization. Oxford
     formal definition of big data based on its essential
     features. Journal of Library Review, vol. 65, no. 3,                 University Press (2009)
     pp. 122–135 (2016)                                              [22] Chen, C., Hardle, W., Unwin, A.: Handbook of data
 [9] Dion, M., AbdelMalik, P., Mawudeku, A.: Big data
                                                                          visualization. Springer (2008)
     and the Global Public Health Intelligence Network               [23] Keim, A., Mansmann, J., Thomas, S., Ziegler, H.:
     (GPHIN). vol. 41, pp. 209-219 (2015)                                 Visual analytics: scope and challenges. Berlin,
[10] Heudecker, N., Beyer, A., Laney, D., Cantara, M.,                    Heidelberg, Springer-Verlag (2008)
     White, A., Edjlali, R., McIntyre, A.: Predicts 2014:            [24] Few, S.: Now you see it: simple visualization
     big data. gartner insight. Gartner Research,                         techniques for quantitative analysis. Analytics
     Stanford, Connecticut (2013)                                         Press, Oakland (2009)
[11] Bhagattjee, B.: Emergence and taxonomy of big                   [25] NESSI.: Big data a new world of opportunities.
     data as a service. Working Paper CISL# 2014-06.                      White Paper (2012)
     Massachusetts Institute of Technology (2014)                    [26] Rokach, L., Maimon, O.: Data mining with decision
[12] Cheng, S., Liu, Shi, Y., Jin, Y., Li, B.: Evolutionary               trees: theory and applications. World Scientific
     computation and big data: key challenges and future                  Publishing (2008)




                                                               353