The Development of the Information System for Anomality
Detection in the Utility Meters Data Using Self-Organized Maps

Ivan Azarov a, Roman Voronkin a, Ilya Chaika a, Alena Lyurova a, Michail Kotlov a
a
    North Caucasus Federal University, 2 Kulakova str, Stavropol, 355029, Russia


                 Abstract
                 In this article, a project has been developed for the modernization of the data analysis technology of
                 the system for accounting for the consumption of utility resources. The need to improve the system
                 is due to insufficient efficiency in identifying the facts of unaccounted consumption of utility
                 resources. Automation of data analysis processes will be based on the development of an artificial
                 neural network. A feedforward network based on a multilayer perceptron and consisting of 1 hidden
                 layer was chosen as a model. The backpropagation algorithm was chosen as the method for training
                 the neural network.

                 Keywords 1
                 Neural networks, self-organized map (SOM), information system, energy efficiency, energy saving,
                 commercial accounting

1. Introduction
    In the context of the accelerating growth of global energy consumption with a volume of non-renewable
energy resources on Earth real reduction, one of the most urgent problems is the problem of energy
efficiency and energy conservation. Thus, the priority goals of modern motherland and global energy policy
have become the achievement of maximum energy efficiency and worldwide energy saving. [1].
    Manual analysis of large volumes of information on the consumption of communal resources by the
population in the monitored area will lead to significant losses for both consumers and resource supplying
organizations.
    In this regard, in the field of housing and communal services, the issue of organizing reliable and timely
detection of energy losses has arisen. One of the ways to solve this problem is to analyze the readings of
metering devices.
    The analysis of the meter readings provided by the consumer by the management company consists of:
    1. checking the status of individual and general metering devices, the fact of their presence or absence;
    2. checking the reliability of the meter readings provided by the consumer by checking them with the
readings of the corresponding meter at the time of the check;
    3. providing of reports on detection of unauthorized or unaccounted consumption of utility resources to
the department of housing and communal services management.
    The main purpose of the analysis of the meter readings provided by the consumer is to identify the facts
of unauthorized or unaccounted consumption of utility resources, in order to increase control over the
consumption of utility resources in apartment buildings.


YRID-2020: International Workshop on Data Mining and Knowledge Engineering, October 15-16, 2020, Stavropol, Russia
EMAIL: azarov8282@mail.ru (Ivan Azarov); roman.voronkin@gmail.com (Roman Voronkin); igull98@mail.ru                  (Ilya   Chaika);
a8923719125@yandex.ru (Alena Lyurova); mikhailits161@yandex.ru (Michail Kotlov)
ORCID: 0000-0002-6810-8152 (Ivan Azarov); 0000-0002-7345-579X (Roman Voronkin); 0000-0003-4448-8901                  (Ilya   Chaika);
0000-0003-4005-4399 (Alena Lyurova); 0000-0001-5114-1012 (Michail Kotlov)
              2020 Copyright for this paper by its authors.
            Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
            CEUR Workshop Proceedings (CEUR-WS.org)

                                                                                                                                 13
2. Formulation of the problem
   Nowadays, the system for analyzing the consumption of utility resources is imperfect and has a number of
problems, one of which is the determination of the time and location of the leak.
   The analysis of meter readings based on checking the deviations of the total volume of apartment meter
readings from the general house meter readings. Figure 1 shows a context diagram showing the business
process of analyzing the commercial accounting of utility resources.


Figure 1: Context diagram

   Let us make the decomposition of commercial accounting and use it to consider the main tasks.


Figure 2: Detailing the context diagram (IDEF0 - AS-IS)

   Modern technology for analyzing meter readings has a number of disadvantages:
    comparison of the "output indicators" available from the supplier of resources with the "input
indicators" available to the consumer is carried out in manual mode, which is extremely rarely carried out in
terms of efficiency, therefore, there is no reliable analysis of network losses and the need for their
modernization;
    time is spent on preparing and searching for the necessary data;
    manual processing of information leads to numerous errors;
                                                                                                          14
     lack of universal presentation of information;
     high complexity of information processing;
     imperfect organization of collection and registration of initial information;
     a large volume of paper workflow;
     the identification of emergency situations does not occur immediately, and the bills for excess costs
fall on the shoulders of management companies;
     insufficient data to analyze network losses and assess the need for their modernization;
     high complexity of information processing.
    All of these shortcomings form a problem associated with the loss of information about the consumed
utility resources. Suppliers are paid for the entire amount of resources provided, and consumers are paid only
for the amount they have consumed. As a result, utility tariffs for the population are overstated.
    To improve the organization of business processes, the task was set to automate the analysis of
commercial accounting data for the consumption of utility resources. This task includes the following points:
    modernize the processing and analysis of meter readings. This task belongs to the class of
tasks "data analysis". Data analysis is currently not fully carried out in this area. All discrepancies
identified in the readings of general house meters and in total individual, ones are distributed
equally among all consumers of a given object.
    create a mechanism for identifying the facts of leakage or theft of resources.
   The created subsystem must meet the following requirements:
    analysis of the amount of consumed resources;
    identification of the facts of leakage and theft of energy resources;
    generating reports.

3. Method
    The entire technological process can be subdivided into the processes of collecting and entering initial
data into the computing system, the processes of placing and storing data in the system memory, processing
data in order to obtain results and processes for issuing data in a form that is convenient for the user to
perceive. Data collection and recording operations are carried out using various means.
    The subsystem for analyzing the consumption of information of utility resources uses an automatic
method of collecting information. In the developed software and hardware complex for energy accounting,
the technological process of collecting information by submitting data to the system automatically, i.e. as
soon as new information appeared in the system of accounting for the consumption of utility resources, at
that moment they enter the neural network and undergo analysis.
    The analysis subsystem will include tools for creating, training, saving and loading an artificial neural
network. With the help of this subsystem, it will be possible to create a multilayer artificial neural network
and train it by the method of back propagation of an error.
    A simple analysis algorithm calculates a certain average value and looks for deviations on its basis, but, in
this case, for different cities, seasons and other conditions, it would be necessary to change the algorithm or
provide for all possible scenarios, which is not always possible to do. To achieve greater flexibility, they use
processes of classification and clustering, allowing to fully perform the required information processing, for
its subsequent analysis by a specialist [18].
    To solve this problem, data clustering algorithms are well suited, since the task itself boils down to
searching for anomalies. If we divide all the data coming from metering devices into clusters, then those data
that do not fall into any cluster will be considered anomalous.
    Consider various machine-learning methods that allow you to cluster information, both based on existing
precedents and with the help of specialists. Also considered are clustering algorithms based on structural,
metric and probabilistic approaches.
    The K-means method is used to select groups of objects in the economy, in data analysis, and in
information retrieval systems. The k-means method is used to cluster data based on an algorithm for dividing
a vector space into a predetermined number of clusters k. The advantage of the algorithm is speed and ease
of implementation. The disadvantage of the algorithm is the uncertainty in the choice of the initial cluster
centers, and this algorithm has a relatively long runtime when applied to large databases.


                                                                                                              15
    The PAM (partitioning around medoids) algorithm is similar to the k-means algorithm, only when the
algorithm works, the objects are redistributed relative to the median of the cluster, not its center [19]. The
main disadvantage of this algorithm is the limitation on the amount of data.
    The hierarchical clustering method is used in collecting statistical data and is implemented in statistical
packages. Also used for clustering text documents. The algorithms CURE (Clustering Using
REpresentatives) and BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) are used to
cluster very large sets of numeric data, but they need to set threshold values, and the latter is able to select
only spherical clusters.
    The Fuzzy C-means algorithm allows for the classification of large sets of numerical data. The method of
fuzzy clustering of C-means can be considered as an improved method of k-means, in which for each
element from the set under consideration the degree of its belonging to each of the clusters is calculated. The
method of fuzzy clustering of C-means has limited application due to a significant drawback - the
impossibility of correct partitioning into clusters, in the case when the clusters have different variances in
different dimensions (axes), and has great computational complexity.
         In [17], it is shown that the Kohonen network gives better performance compared to the K-means
method, has high accuracy, as well as minimal computation time for the same data set and parameters.
         To solve the problem of searching for anomalies, the Kohonen neural network was chosen, which
learns without a teacher. The main advantage of the network is that there is no need to keep all processed
data in the computer's RAM. The second important advantage will be high resistance to noisy data, that is,
possible small deviations that erode clusters will not be considered anomalies.
         This type of neural network operates on a winner-take-all basis.
         The most interesting property of Kohonen networks is self-organization, namely, the repetition of
objects in an N-dimensional space. "Regular" one-dimensional Kohonen networks are used for data
clustering, multi-dimensional Kohonen networks can be used for image recognition.
         Kohonen network training contains a number of parameters, such as a function of the learning rate,
an algorithm for initializing neuron weights, optimization methods, the choice of which significantly affects
the training result.


Figure 3: Diagram of the Kohonen network

    Algorithms for searching for anomalies used in the work. The basis of these methods is the study of the
clusters of objects obtained after training. Objects for which the winning neuron is the same neuron form a
cluster. The algorithm based on estimating the distance from the center of a cluster to objects from this
cluster.
    Suppose, based on the results of training the network, there is a trained network and a partition of the
training sample into clusters. Several preliminary calculations done to identify anomalies among the full set
of objects. Consider one of the clusters obtained from the training sample. Let's denote N - the number of
vectors in the cluster, V = {v2 = ⟨v_1 ^ i,…. v_k ^ i⟩ | i = 1… N} is a set of vectors from this cluster. The
vectors must be normalized. The center of the cluster is the mean of the coordinates of the vectors of this
cluster, namely:
                                                                                                             16
                                  c = (c1 , …, cK), where          ∑     .                               (1)
   Let us calculate the cluster radius:
                                r= ∑       (     )            ( ) Euclidean distance.
   The next step is to calculate the rms deviation in the training set:
                              (          ) , where      √∑     (        ) .                              (2)
   This concludes the preliminary calculations. The values c, r, σ are saved to the database and can
be used later. To improve the accuracy of detecting anomalies, formulas (7) and (8) are recalculated
for the full set of objects. All vectors supplied to the input of the algorithm must be normalized.
Anomalies are detected as follows: if the following condition is satisfied for some normalized
vector v from the complete set of objects: d(v, c) > r + 2‖ ‖, (9) then the object corresponding to the
vector v is an anomaly.
   The number of clusters into which to split the input sample depends on the network
hyperparameters. The issue of choosing the number of clusters requires additional study before
implementing a neural network.
   Kohonen's network will recognize clusters in the training data and assign all data to one cluster
or another. If, after that, the network encounters a dataset that is not similar to any of the known
samples, then it will not be able to classify such a dataset and thus reveal its anomalousness.

4. Design results
   Figure 5 shows a diagram of the decomposition of the main business process "AS TO-BE" of the
technology for analyzing meter readings.


Figure 4: Context diagram

   If there is an information subsystem for analyzing the consumption of utility resources, the controller no
longer needs to walk around the apartments to verify the readings. Also, when using this subsystem, not only
facts about the leak and the amount of energy resources will be highlighted, but also the places where the
leak occurred.
   The process of functioning of an information system is a purposeful transformation of input information
into output information. Information in the system passes through several software components. The diagram
shown in figure 6 shows all the developed modules and the relationships between them.


                                                                                                          17
Figure 5: Diagram of the information subsystem functioning

    The presented block diagram of the package includes all the developed modules and reflects the
relationship between them.
    Sources of receipt of operational and conditionally permanent information is the system for recording the
readings of water, gas, and electricity meters. Primary information is calculated indicators on the volumes of
supplied utilities.
    The resulting documents will be a report on the revealed facts of leakage or theft of resources.
    Information input and display of result data performed automatically. In addition, the formation of
reporting – through forms, which are the main interactive means of the program user.


5. Conclusion
    The first advantage of the developed system is its flexibility, provided by the use of a neural network as
an analysis system. This system is suitable for use in different conditions, it is able to take into account the
specifics of a given city or settlement. In addition, this system is more adaptive in comparison with the
classical algorithms for analyzing the readings of utility meters and has the property of scalability. However,
the advantages of the project also lead to its disadvantages: the use of a neural network presupposes at least
basic knowledge of the principles of its operation, has a relatively higher complexity, and also cannot be
used immediately after its implementation and requires time and a relatively large amount of data for
training. ... Application of the developed information system is associated with the following economic
advantages:
     reduction of costs for manual processing of readings from meters
     reduction of personnel labor costs
     availability of automated functionality to identify the facts of emergencies, leaks or theft of resources;
     availability of a system for automatic forecasting of resource consumption.

6. References

   [1] Kalitin D.V. Artificial neural networks [Electronic resource]: tutorial / Kalitin DV - Electron. Text
       data. — Moscow: Misis Publishing House, 2018. — 88 p.


                                                                                                             18
[2] Sedov V.A. Introduction to neural networks [Electronic resource]: guidelines for laboratory work in
     the discipline "Neuroinformatics" for students of the specialty 09.03.02 "Information systems and
     technologies" / Sedov VA, Sedova NA - Electron. Text data. — Saratov: IP Er Media, 2018. — 30 p.
[3] Citizen E.I. Neural networks [Electronic resource]: textbook / EI citizen - Electron. Text data. —
     Samara: Volga State University of Telecommunications and Informatics, 2017. — 84 p.
[4] Yakhyaeva G.E. Fuzzy sets and neural networks [Electronic resource]: tutorial / Yakhyaeva GE -
     Electron. Text data.— Moscow: Internet University of Information Technologies (INTUIT), IPR
     Media, 2020.— 315 c
[5] Adrian Iustin Georgevici & Marius Terblanche Neural networks and deep learning: a brief
     introduction 06 February 2019
[6] Aczon M, Ledbetter D, Ho L, Gunny A, Flynn A, Williams J, Wetzel R (2017) Dynamic mortality
     risk predictions in pediatric critical care using recurrent neural networks. Arxiv 1701.06675
[7] Arguello Casteleiro M, Maseda Fernandez D, Demetriou G, Read W, Fernandez-Prieto M, Des Diz
     J, Nenadic G, Keane J, Stevens R (2017) A case study on sepsis using pubmed and deep learning for
     ontology learning. Informat Health 235:516–520
[8] Raghu A, Komorowski M, Celi LA, Szolovits P, Ghassemi M (2017) Continuous state-space models
     for optimal sepsis treatment – a deep reinforcement learning approach. Arxiv 1705.08422
[9] Komorowski M, Celi LA, Badawi O, Gordon AC, Faisal AA (2018) The artificial intelligence
     clinician learns optimal treatment strategies for sepsis in intensive care. Nat Med 24:1716–1720
[10] Barsky A.B. Introduction to neural networks [Electronic resource]: textbook / Barsky AB - Electron.
     Text data.— Moscow, Saratov: Internet University of Information Technologies (INTUIT), IPR
     Media, 2020.— 357 pp. — Access mode: http://www.iprbookshop.ru/89426.html .— EBS
[11] Avati A, Jung K, Harman S, Downing L, Ng A, Shah NH (2017) Improving palliative care with deep
     learning. In: 2017 IEEE international conference on bioinformatics and biomedicine (BIBM), Kansas
     City, 2017, pp. 311–316.
[12] Beaulieu-Jones BK, Orzechowski P, Moore JH (2017) Mapping patient trajectories using
     longitudinal extraction and deep learning in the MIMIC-III critical care database. Biorxiv 5:177428
[13] Carneiro G, Oakden-Rayner L, Bradley AP, Nascimento J, Palmer L (2017) Automated 5-year
     mortality prediction using deep learning and radiomics features from chest computed tomography.
     In: 2017 IEEE 14th international symposium on biomedical imaging (ISBI 2017), Melbourne, 2017,
     pp. 130–134.
[14] Miotto R, Wang F, Wang S, Jiang X, Dudley JT (2017) Deep learning for healthcare: review,
     opportunities and challenges. Brief Bioinform 19(6):1236–1246
[15] Mamoshina P, Vieira A, Putin E, Zhavoronkov A (2016) Applications of deep learning in
     biomedicine. Mol Pharm 13:1445–1454
[16] Chizhkov A.V. training of artificial neural networks computer science, computer technology and
     engineering education / A.V. Chizhkov - Rostov-on-Don: Publishing house of the Southern Federal
     University - 2010.
[17] Gurpreet Singh, Amandeep Kaur Comparative Analysis of K-Means and Kohonen SOM data mining
     algorithms based on student behaviors in sharing information on facebook. International Journal Of
     Engineering And Computer Science Volume 6 Issue 4 April 2017, Page No. 20990-20993
[18] S. Tcherezov, N.A. Tyukachev review of the main methods of classification and clustering of data /
     Voronezh State University 2009
[19] Parfenov, D.I., Bolodurina, I.P., Lapina, M.A. Development of a model for detecting security
     incidents in event flows from various components in a network of telecommunication service
     providers. IOP Conference Series: Materials Science and Engineering, 2020, 873(1), 012020


                                                                                                     19