Choice of a Deep Neural Networks Architecture to Monitor the Dynamics of an Object State Andrey Puchkov1, Maksim Dli1, Yekaterina Lobaneva1, Maria Vasilkova1 1 National Research University «Moscow Power Engineering Institute» (Branch) in Smolensk, Energetichesky proyezd 1, g. Smolensk, 2014013, Russia putchkov63@mail.ru, MiDli@mail.ru, lobaneva94@mail.ru, vasilkova_mariya00@mail.ru Abstract. The study proposes a deep neural network architecture to monitor the dynamics for a state of a complex technological object according to the data received in the form of images. The paper also contains recommendations for the architecture adaptation to a specific application. The developed architecture is based on the cascade use of convolutional neural networks for processing multi-channel video information from different technological zones of one object. Key words: machine learning, convolutional neural networks, computer vision Introduction Now deep neural networks (DNN) represent the most practically significant direc- tion in the development of artificial intelligence methods. DNN are actively used in the systems of video analytics to obtain metadata from a video stream. From the technical point of view, video analytics is a software and a hardware complex for the intellectual analysis of the events that fall into the sector of video surveillance systems and undergo deep processing by software tools. On average, only 10% of data, which a camera is able to give for processing, are used. The intellectualization of this pro- cess allows increasing the part of the useful information implementation. Initially, in the systems of industrial safety video analytics was applied as a detec- tion of an object movement or crossing of a control line, the objects identification (people, transport, luggage), their behavior estimation [1]. However, the opportunities of computer vision use in manufacturing are not limited by this area of application. Within the development program “Digital economy of the Russian Federation”, ap- proved by the Russian government in 2017, the following promising directions for video information processing in automated systems for technological process control (ASTPC) with the use of artificial intelligence methods can be specified: – reidentification of objects and processes during their transition from one pro- duction zone to the other in accordance with the accepted technological and logistic chains; Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons Li- cense Attribution 4.0 International (CC BY 4.0). In: P. Sosnin, V. Maklaev, E. Sosnina (eds.): Proceedings of the IS-2019 Conference, Ulya- novsk, Russia, 24-27 September 2019, published at http://ceur-ws.org 143 – automated control of safety measures requirements; – solution of transport problems under conditions of high dimensionality of initial data, when the methods for linear programming lead to the significant time expendi- tures; – search and prediction of equipment failures to reduce the probability of incident situations; – implementation of multidimensional machine vision based on processing infor- mation from a large number of sensors in order to monitor technological processes (TP) in real time and predict their behavior; – business analytics based on the generation of metadata for a state of a production process by intellectual cameras, as well as the detection of hidden regularities in visualized information about the results of an enterprise commercial activity. Most of the noted directions for the artificial intelligence methods implementation consider video data processing and include the estimation for the change rate state of the object under observation. The image changes of the object under study can be applied as a visual prompt in decision–making process as the meaning of many ac- tions is precisely in the dynamics, it is enough to observe the movement of individual points in order to recognize the event. In this case the changes are understood as a wide set of characteristics for images, i.e. the shifting of any objects and contour in the background, changes the brightness of the elements, texture modification [2]. The problem of developing and adapting machine vision methods and algorithms to assess the rate for the change of TP state, taking into account the specificity of production, is actual due to the significant diversity of nature of the observed objects and processes. 1 Problem statement Complex TP are characterized by the significant duration not only in time but in space as well. This fact specifies the reasonability to include into APCS not only signals from standard control and measuring equipment, but from the additionally installed systems of visual control for technological zones with high responsibilities as well [3]. Suppose there are kz of technological zones for which video cameras are installed. Each video camera gives a stream of shots with the resolution of n[ip]×m[ip] pixels, where ip=1, 2, …,kz and frequency f. As a result, video data from all the cameras are mapped by tensor X of the sixth rank with the form: a camera, samples, shots, height, width, color [4]. Different camera models allow forming a video with the frequency of shots in a big range; usually it is from 10 to 60 fps (frames per second). Then, to be processed by the convolutional neural network, the discretization interval at time Δt(i), i=1, 2, …, I, where I is a number of information video channels, for i-th information channel, is chosen to be more than one second. This choice is based on the assumption that the inertia of TP allows to do it without violation of Kotelnikov’s theory which sets the maximum value of the discretization interval, at this value the accurate restoration of initial continuous signal is possible. If the discretization interval is needed to be less than one second, then it is necessary to choose a video camera model more carefully, 144 but the methodology of the given bellow approach to the dynamics recognition is not changed. It is required to develop an architecture of a deep neural network to detect metadata from the forming tensor X, the metadata provide the recognition and forecast for “movement” (evolution) of TP in time. The estimation for the recognition quality is implemented on the base of confusion matrix CM [5]. 2 Methods background for recognition of the processes dynamics according to video data DNN find their application in algorithmic support of video analytics systems for various application areas, for example: to detect technological defects [6], medical diagnostics [7], sensor information processing [8], vehicles identification [9] etc. However, in most cases these algorithms do not support simultaneous description of an object by its image and motion, that will allow recognizing the events even at a low resolution and predict the evolution of an observed object state. The exception in this case is only a direction connected with the recognition of people actions [10 – 11], the difficulty of this direction is associated with the necessity to take into account the environment. The sense interpretation of a recognizing action depends on this environment. One of the popular methods, called “sliding window”, consists of several stages: fragment selection (spatio-temporal parallelepiped); solution of the images classifica- tion problem and search for the objects for three-dimensional spatio-temporal vol- ume. However, it does not suit the automated recognition of changes, as it requires prior indication of window borders on the image which makes it difficult to be ap- plied when the boundaries of this window need to be shifted in the observed techno- logical process. In other methods for analyzing video sequences the basic tool is the concept of op- tical flow. This approach was first proposed by Bruce D. Lucas and Takeo Kanade in1981. Optical flow is often defined as a vector field or an image of objects apparent motion, surfaces or scene edges resulting from the motion between an observer and a scene [12]. In the process of analyzing the optical flow for each pixel of one shot, the displacement vector is calculated from the current shot to the next one. As a result a process of matching occurs: for each pixel of one shot the same point on the other shot is found. The disadvantages of this approach include the necessity to analyze the solving of optical flow problem which entails the need to control aperture problem [13]. On the basis of the mentioned methods specialized software products are creat- ed, in particular, the software for mapping and measuring particles flow rate of any environment [14]. The proposed algorithm is based on the application of DNN ensemble, the struc- ture of which has a temporary time delay unit to enable the recognition of objects dynamics by video data. 145 3 Proposed solutions Advances in the use of deep neural networks in computer vision systems for vari- ous purposes provide reasons for optimism in the case of recognition of TP processes dynamics. This approach reflects a modern paradigm of Software 2.0 which unlike Software 1.0 does not imply the explicit writing of an algorithm , it provides the crea- tion of a neural network with a specified architecture which learns (adjust) itself to solve a specific applied task. Deep neural network models hierarchical abstractions in data using architectures which consist of cascading ensembles of nonlinear transformations (filters). Today there are some popular DNN architectures: neocognitron, autocoder, convolutional neural networks (CNN), Boltzman machine, deep trust networks, long short-term memory networks, controlled recurrent networks, residual neural networks [15]. This study uses convolutional neural networks. The architecture of a neural network defines the hypotheses space, i.e. the number of classes for the input data sets splitting. The proposed architecture for a deep neural network to monitor the object dynamics is in Fig.1. It uses the successful practices of neural networks ensemble application [16 – 18] but differs from them in presence of a time delay unit, the signal from which is also fed to the output cascade of the network to enable the recognition of TP state changes. After the calculation for the intervals of discretization according to time Δt(i) for all channels the minimum Δt = min i (Δt(i)) is chosen for further synchronization and unification of video data transformation performed by the neural network. Also, to unify further transformation the input multichannel images with the reso- lution of n[ip]×m[ip]pixels, before being fed to the neural network input, are normal- ized to one dimension of n×m pixels which is smaller than the minimum from n[ip]×m[ip]. In the time delay block the shift is done by moment Δt of image receiving which makes it possible to calculate discrete analogues of derivatives when defining the values changes at neural networks outputs. The input cascade of a deep neural network contains some CNN operating in parallel classifying images from cameras of a corresponding information channel taken at intervals Δt. This procedure consists in forming output channel vector V (j|i), j=1,2,…, cl(i), where cl(i) is a number of classes for i-th information cannel, at each moment Δt. 146 CNN input cascade CNN internal cascade CNN output cascade Process state video data Dynamics process forecast Time delay Fig. 1. Deep neural network architecture CNN of an input cascade forms the elements of vectors V(j|i), which take the val- ues in the range from 0 to 1. This reflects the i-th channel CNN degree of confidence in membership of the parameter controllable according to the image to a particular class at moment t(k)= kΔt, where k is a sequence number of time discrete. The time interval during the technological process is divided into fragments with ΔT duration. Consider the fragments for all information channels are equal and de- fined by the requirements for the periodicity of information flow to APCS and char- acteristics of the most frequency-critical channel. By moment T(ζ)= ζ ΔT, ζ =1, 2, …,ψ, where ψ is a number of fragments with ΔT duration for each i-th channel, the matrix of classification results can be formed: ⎛ V (1| i,1) V (1| i, 2) ... V (1| i, k ) ⎞ ⎜ ⎟, (1) V (2 | i ,1) V (2 | i , 2) ... V (2 | i , k ) MV (i | ξ ) = ⎜ ⎟ ⎜ ... ... ... ... ⎟ ⎜ ⎟ ⎝ V ( cl (i ) | i ,1) V ( cl (i ) | i , 2) ... V ( cl ( i ) | i , k ) ⎠ where element V(j|i, k) means the j-th CNN output of the input cascade for the i-th information channel at time discrete kΔt. Matrix (1), in fact, reflects CNN degree of confidence changes in classification results when TP passes interval ΔT under number ζ. CNN internal cascade receives the tensor consisting from the combined particular matrixes (1) for all information channels. At the output of the internal cascade matrix MS is formed. It is an analogue of matrix (1), containing more number of elements according to the numbers of information channels and the classes of TP state for each channel. To have an idea about the dynamics of the entire technological process tensor DV is calculated. Tensor DV contains the relations of MS matrix elements increments to Δt discretization interval. The range of this tensor is equal to three, and its ζ-th cut has a form of: 147 ⎛ MSξ (1,1) −ΔMS t ξ −1 (1,1) ... MSξ (1, I ) − MSξ −1 (1,i ) Δt ⎞ ⎜ MS (2,1) − MS (2,1) ⎟ DV (ξ ) = ⎜ ξ ξ −1 ... MSξ (2, I ) − MSξ −1 (2,i ) ⎟ (2) Δt Δt ⎜ ⎟ ⎜ ... ... ... ⎟. ⎝ ⎠ Cut (2) can be interpreted as an image fed to the output of CNN cascade. The sense load for the elements of cut (2) can be matched with the analogue of derivatives for the continuous functions as they reflect the confidence change of neural network in image membership to a certain class. The rate of change can be used to forecast TP development. It should be mentioned, that the number of classes for different information chan- nels and different technological zones can be different. Thus, to provide the propor- tionality when feeding tensor DV for neural network processing some cuts contain zeros on the places of redundant classes. Tensor DV is formed on the base of the data about the CNN confidence changes in class membership of information channels parameters reflected in matrix (1). It al- lows assuming that hypotheses space formation can provide good forecasting results with the help of the proposed deep neural network architecture. 4 Results The recommended choice for the deep neural network architecture means to insure the correspondence for the number of channels of the video information flow with the number of convolutional neural networks in the input cascade. In this case the authors give the results of simulation experiment when processing ingots images of aluminum alloy with the aim to forecast the time from its complete melting. The melt- ing process takes approximately 300 seconds; the aggregate state is estimated by the surface image observing through a viewing window fitted on the furnace door [19]. The model program was performed in IDE Spyder from Anaconda (version for Linux) in Python 3.6. CNN were created with the special neural network library Keras which is the add-on the framework of Tensor Flow calculations [20]. To visualize TensorFlow process framework TensorBoard is used. In the considered example only one informational channel from the technological melting zone is used, thus the structure of the applied network is significantly simpli- fied. The network contains seven alternate convolution layers and subsamples and one output fully connected layer with four outputs according to the number of defying classes for the substance aggregate state: class «solid» (0 – 269 sec.), «initial transit » (270 – 279 sec.), «final transit» (280 – 289 sec.), «liquid» (290 – 300 sec.). Thus, the time interval is connected with the class, therefore, when making the classification the time of the aggregate state occurrence is forecasted. Fig.2 shows the images of a melting zone taken in different moments, Fig.2b shows tensor DV image reflecting the dynamics of images presented in Fig.2a. The sequence of 2b images forms the dynamics trend for the melting process which is recognized by CNN. 148 а) b) Fig. 2. Processed images The initial learning sampling for each class had 1600 examples; the testing sam- pling had 400 examples. In addition to the standard Dropout method, used in CNN, augmentation was implemented to reduce the possibility of the network relearning. Shifts, scale changes, rotation, mirror reflection, affine transformations were realized for the initial images of the ingots working surface. As a result of these procedures, the total size of the training sampling was 32,000 examples; the total size of testing sampling was 8000 examples. The network was trained during 110 epochs. Categori- cal entropy was set as a loss function for CNN. The splitting of video data into shots with conversion into jpg format is performed with software utility Free Video to JPG Converter with a given discrecity of one se- cond. This approach allows obtaining sufficient learning sample volume for CNN as a great number of images can be detected even from the video recording of one melting process. The learning is conducted on video card GeForce GTX 1060 installed on Asus FX502VM notebook with CPU IntelCore i7-7700HQ, which provides more than twenty fold time gain comparing with the learning on a regular notebook processor. The quality of CNN learning is reflected in accuracy metric which is 77 % on testing samples. The graphics for loss function and accuracy function are shown in Fig. 3. Сonfusion matrix (CM), used to estimate the classification quality, has four rows and four columns in this case. Its columns reflect the factual data; its rows show the results of the classifier work. 149 Fig 3. Graphics for behavior of loss function and accuracy function. When filling the matrix the number on the cross of the class row, returned by the classifier, and the class column, to which the object really belongs to, increases by one. In this example every time interval corresponding to different melting stages is divided into 10 subintervals ΔT. After the conducted experiments matrix CM is filled: ⎛7 2 1 0⎞ ⎜ ⎟ 1 6 2 1⎟. CM = ⎜ ⎜2 1 6 1⎟ ⎜ ⎟ ⎝0 2 1 7⎠ The analysis of CM matrix elements values shows that the majority of classes are recognized correctly as matrix diagonal elements are clearly expressed. 5 Conclusion In the process of the conducted study for the possibility of deep convolutional neu- ral networks application to monitor the dynamics of technological objects state the following results were obtained: 1. The architecture of a deep neural network to obtain information about the dy- namics of the technological object state based on the video data fed from different technological zones is proposed. Its basis contains convolutional neural networks with cascades differentiation at input, internal and output which allows integrating 150 different information flows into the entire set and increase the level of the presented data abstraction. In the output cascade of a neural network the time delay in images processing is used to enable the object state recognition dynamics and its forecast. 2. Recommendations on the choice and adaptation of the neural network architec- ture depending on the number of technological zones and the number of information channels are given. 3. The study presents the results of simulating experiment which show the efficien- cy of the proposed neural network architecture. The results can indicate the reasona- bility of the architecture use in various application areas where it is necessary to con- trol the dynamics of the processes by the available video information. 6 Acknowledgment The reported study was funded by RFBR according to the research project № 19-01-00425. References 1. Rubio D. Videoanalytics: possibilities and solutions. (2016) Modern automation technol- ogies, 4, pp. 86-92. 2. Li H., Qian X., Li W. (2017) Image Semantic Segmentation Based on Fully Convolutional Neural Network and CRF. In: Yuan H., Geng J., Bian F. (eds) Geo-Spatial Knowledge and Intelligence. GRMSE 2016. Communications in Computer and Information Science, vol 698. Springer, Singapore. 3. Pokhabov Y.P. Problems of dependability and possible solutions in the context of unique highly vital systems design. Dependability. 2019; 19(1): S. 10 – 17. (In Russ) https://doi.org/10.21683/1729-2646-2019-19-1-10-17. 4. Chollet F. Deep learning with Python (2018) Peter, SPb. 5. Shunina Yu. S., Alekseeva V.A., Klyatchkin V.N. Performance criteria for classifiers (2015) UlGTU bulletin, 2(70) URL: https://cyberleninka.ru/article/n/kriterii-kachestva- raboty-klassifikatorov 6. Cha YJ., Choi W. (2017) Vision-Based Concrete Crack Detection Using a Convolutional Neural Network. In: Caicedo J., Pakzad S. (eds) Dynamics of Civil Structures, Volume 2. Conference Proceedings of the Society for Experimental Mechanics Series. Springer, Cham 7. Kori A., Soni M., Pranjal B., Khened M., Alex V., Krishnamurthi G. (2019) Ensemble of Fully Convolutional Neural Network for Brain Tumor Segmentation from Magnetic Reso- nance Images. In: Crimi A., Bakas S., Kuijf H., Keyvan F., Reyes M., van Walsum T. (eds) Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries. Brain- Les 2018. Lecture Notes in Computer Science, vol 11384. Springer, Cham 8. Kasnesis P., Patrikakis C.Z., Venieris I.S. (2019) PerceptionNet: A Deep Convolutional Neural Network for Late Sensor Fusion. In: Arai K., Kapoor S., Bhatia R. (eds) Intelligent Systems and Applications. IntelliSys 2018. Advances in Intelligent Systems and Compu- ting, vol 868. Springer, Cham 151 9. Xiang, L., et al.: Automatic vehicle identification in coating production line based on com- puter vision. In: International Conference on Computer Science and Engineering Technol- ogy, pp. 260 – 267. World Scientific Publication Co. Pvt. Ltd. (2016) 10. Ahlawat S., Batra V., Banerjee S., Saha J., Garg A.K. (2019) Hand Gesture Recognition Using Convolutional Neural Network. In: Bhattacharyya S., Hassanien A., Gupta D., Khanna A., Pan I. (eds) International Conference on Innovative Computing and Commu- nications. Lecture Notes in Networks and Systems, vol 56. Springer, Singapore. 11. Fan Y., Lam J.C.K., Li V.O.K. (2018) Multi-region Ensemble Convolutional Neural Net- work for Facial Expression Recognition. In: Kůrková V., Manolopoulos Y., Hammer B., Iliadis L., Maglogiannis I. (eds) Artificial Neural Networks and Machine Learning – ICANN 2018. ICANN 2018. Lecture Notes in Computer Science, vol 11139. Springer, Cham 12. Solovich I.O., Belov Yu. S. Lucas-Kanade method application to calculate optical flow. (2014) Engineering journal: science and innovations, 7 URL: http://engjournal.ru/catalog/pribor/optica/1275.html 13. Nagiev A.G., Sasyhkov V.V. The problem of aperture delay in digital measurement sys- tems and its analytical solution using the matrix exponential method. (2017) Measuring engineering, 9, p.p..16 – 20. 14. Thielicke W. (2019). PIVlab - particle image velocimetry (PIV) tool(https://www.mathworks.com/matlabcentral/fileexchange/27659-pivlab-particle- image-velocimetry-piv-tool ), MATLAB Central File Exchange. Retrieved April 24, 2019. 15. Sozykina A. V. Overview of deep neural network learning methods. (2017) YuUrGU bulletin: computational mathematics informatics,6(3), p.p. 28-59. 16. Frazão X., Alexandre L.A. (2014) Weighted Convolutional Neural Network Ensemble. In: Bayro-Corrochano E., Hancock E. (eds) Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. CIARP 2014. Lecture Notes in Computer Science, vol 8827. Springer, Cham 17. Koitka S., Friedrich C.M. (2017) Optimized Convolutional Neural Network Ensembles for Medical Subfigure Classification. In: Jones G. et al. (eds) Experimental IR Meets Multi- linguality, Multimodality, and Interaction. CLEF 2017. Lecture Notes in Computer Sci- ence, vol 10456. Springer, Cham 18. Puchkov A., Dli M., Kireyenkova M. (2020) Fuzzy Classification on the Base of Convolu- tional Neural Networks. In: Hu Z., Petoukhov S., He M. (eds) Advances in Artificial Sys- tems for Medicine and Education II. AIMEE2018 2018. Advances in Intelligent Systems and Computing, vol 902. Springer, Cham. 19. Shkundin S.Z., Kolistratov M.V., Belobokova Yu. Testing the performance of algorithms for determining changes in the aggregate state of a metal. (2018)System administrator, 10 (191), p.p. 90-93. 20. Geron Au.. Applied machine learning using Scikit-Learn and TensorFlow: concepts, tools and the technique for intellectual systems creation. (2018) Dialectic, Moscow.