-

Multilabel Classi cation for In ow Pro le Monitoring?

Dmitry I. Ignatov

] dignatov@hse.ru

Pavel Spesivtsev

PSpesivtsev@slb.com

Dmitry Kurgansky

mykurgansky@mail.ru

Ivan Vrabie

1 2

vrabie

@mail.ru

Svyatoslav Elizarov

sorkerrer@gmail.com

Vladimir Zyuzin

0 2

VZyuzin@slb.com

0 Moscow Institute of Physics and Technology , Moscow , Russia 1 National Research University Higher School of Economics , Moscow , Russia 2 Schlumberger Moscow Research , Moscow , Russia

2 9

The purpose of this study is to identify the position of nonperforming in ow zones (sources) in a wellbore by means of machine learning techniques. The training data are obtained using the transient multiphase simulators and represented as the following time-series: bottomhole pressure, well-head pressure, owrates of gas, oil, and water along with a target vector of size N, where each element is a binary variable indicating the productivity of the respective in ow zone. The goal is to predict the target vector of active and non-active in ow sources given the surface parameters for an unseen well. A variety of machine learning techniques has been applied to solve this task including feature extraction and generation, dimensionality reduction, ensembles and cascades of learning algorithms, and deep learning. The results of the study can be used to provide more e cient and accurate monitoring of gas and oil production and informed decision making.

Multi-phase ow multilabel classi cation time series bottomhole pressure

During the production phase of oil and gas wells it often happens that oil does not enter every in ow point, which leads to a decrease in the e ciency of the operation and undesired economic consequences4. It is bene cial to determine which of the in ow points are inactive to properly design the intervention operations. The main research hypothesis here is as follows: using the machine learning approaches, the active and non-active in ow points can be predicted based on the measurements of certain parameters at the wellhead, including pressure and total gas and oil productivity.

The paper is organized as follows. In Section 2 we formulate the studied problem as a multilabel classi cation. Sections 3 and 4 explain the data generation process and detail the performed data transformations, respectively. Section 5 describes the time-series speci c feature extraction process. Section 6 presents the obtained classi cation results along with feature importance estimation. Section 7 concludes the paper. 2

Problem Statement

The problem of in ow pro le monitoring can be formulated as follows.

There are descriptions of objects X Rd, where d is the size of feature space, and a nite set of class labels Y 0; 1 L. A nite training set of observations is given as follows: x(i); y(i) N

i=1; where x(i) = (x1; : : : ; xd) 2 X is the description vector of i-th object (one measurement), y(i) = (y1; : : : ; yL) 2 Y is the label vector with yj = (1; if there is an oil in ow at j-th position

0; otherwise.

However, in our case, the description vector x(i) can be recast as containing time series of d sensors within a certain time interval T = f1; 2; : : : ; tg: x(i) = (x1; : : : ; xt)1; : : : ; (x1; : : : ; xt)d 2 Rd t: x(i); y(i) N

i=1, it is necessary to construct a mapping

Using a training set st = function (classi er):

h : X ! Y For each test instance x~ 2 X, we get a prediction: y^ = h(x~).

Thus, the problem of multilabel classi cation is to be solved, in which the object can belong to several classes at the same time, and the classes are not mutually exclusive. For example, this type of problem arises in text mining, namely in automatic tag's assignment, text categorization and classi cation, similarly for categorization of images, etc. Multilabel classi cation is an extension of the traditional classi cation problem with several classes, i.e. multi-class problem. Approaches to solve this problem are mentioned in section 6 and can be partially found in [10, 7]. 3

Data Generation

The training data are obtained as a result of numerical simulations that describe the physical processes taking place in wells [9]. For the given input parameters(wellbore geometry, initial distribution of volume fractions of phases, pressure in the wellhead, choke size, etc.), the simulator models the behavior of the wellbore for a given time interval T and generates the following time series: { BHP (t) is the bottomhole pressure (measured at the source closest to the surface); { W HP (t) is the wellhead pressure; { Qo(t) is the surface oil owrate; { Qw(t) is the surface water owrate; { Qg(t) is the surface gas ow rate.

The target vector y of length 20 is generated randomly and consists of ones and zeros, characterizing the presence or absence of in ow in one of the 20 prede ned in ow points along the wellbore. In the present work, 5000 simulation realizations are used. 4

Data preparation

Given that each time series is large and has complex structure, which may carry latent complex patterns, it is necessary to transform it to a smaller space of more informative features than only the values of the series at a certain timesteps t. For example, one can extract minimal and maximal values, the number of local maxima and minima (\peaks"), take the average and median values, etc. In addition, many machine learning algorithms are sensitive to data scaling. Such algorithms, for example, include nearest neighbor method, Support Vector Machine, etc. In this study, we will use two common types of data normalization: normalization by standard deviation and the Min-Max normalization. Another important task is to reduce the dimension of the feature space using di erent methods, and we will examine the most popular ones, such as:

1. Principle Component Analysis (PCA) 2. Independent Component Analysis (ICA) 3. Truncated Singular Value Decomposition (TSVD). Hence, the original task is divided into two subtasks:

1. Determination of the appropriate feature space X0 2. The choice/tuning of the optimal classi er h.

The average size of 0/1-loss on the test sample of size M is used as a quality criterion. To characterize the average prediction accuracy of each in ow point one can consider the whole vector of 0/1-loss for all in ow points. Thus, in our experiments the averaged accuracy of an in ow point at di erent positions varies showing higher values for several rst positions (closer to the surface, see Fig. 1).

Feature extraction from time series

The set of predictors for training in the initial sample is represented by time series, from which it is possible to extract a set of additional parameters that can positively a ect the quality of algorithms [3].

Fourier transform is one of the basic tools in signal analysis. This transform allows to move from time domain to frequency domain, that is, to get rid of the signal shifts in time. Discrete Fourier Transform (DFT) is used for discrete signals.

An alternative to the Fourier transform is the wavelet transform, which is a convolution of the wavelet function to the signal. The wavelet transform translates the signal from the time representation to its time-frequency representation. For discrete signals, a discrete wavelet transform is applied by a set of lters. First, the signal is passed through a low-frequency lter (LF- lter) with a pulse response g: s^[n] = +1

X s[k]g[n k= 1 k]

At the same time, the signal is similarly decomposed using a high-frequency lter f (HF- lter). The result contains detailed coe cients (after the HF- lter) and approximation coe cients (after the LF- lter). After completing the procedure the samples of the signals are downsampled by a factor of 2.

Di erent output values of linear regression were also used as features. In our case, we used a sample from a time series as a predictor, and a discrete sequence from 0 to a number equal to the length of the sample minus 1 as the target variable.

Another attribute is the mean squared of the time series, which is given below:

The average absolute change was also taken into account, which is simply the following:

E =

X i=1;:::;n

xi2: 1 n

X i=1;:::;n 1 jxi+1 xij:

Among many more parameters that can be used to enlarge the feature space are average, standard deviation, median, dispersion, min/max value, trend, number of min/max values, lower/upper quartile, and last position of min/max value.

All the aforementioned features in this section can be calculated by specialized Python libraries. Here we have used tsfresh library [2] and produced more than 1200 features5. 5 The full list of possible features to extract can be found by the link https://tsfresh.readthedocs.io/en/latest/text/list of features.html

Experiments

To conduct experiments with the data obtained by the simulators, a set of 5000 numerical simulations was generated, for each of which there are indications of 4 di erent sensors that produce measurements for 3600 seconds with a sampling rate of 1 Hz. The average 0/1-loss for each in ow point (or averaged by all of them) on the test sample is used as a quality criterion. The split into training and test sets was made by randomly sampling generated observations in the ratio of 4:1.

The rst experiment was to test the approach of independent classi ers, separately for each of the 20 sources. In addition to selection of the optimal classi er, it is necessary to correctly determine the appropriate feature space X0 . For this purpose, many di erent methods of dimensionality reduction and normalization have been tested both for the initial data and for the extracted time series features. Every dimensionality reduction method was tested on the following set of classi cation algorithms: Random Forest (RF), SVM, kNN, XGBoost [ 1 ]. The mean 0/1-loss varies from 0.36 to 0.39.

The best algorithm was XGBoost with PCA over z-score normalization of features obtained from the time series. The same combination of the dimension reduction method and the algorithm, but with min-max normalization resulted in the third best performance. Experiment 2 was to build an ensemble of the top 10 of the best performing algorithms and determine the label by majority voting. As it had been expected, the results were slightly better, the average value of the loss function 0/1 was equal to 0.31.

During the third experiment aimed at testing the approach of classi er chains [7], a correlation matrix was built between the values of all sources. By chain of classi ers, Read et al. [7] mean a simple classi er cascade where after prediction of the rst component of a target vector, the second component is predicted on the same set of features plus the prediction for the rst component (or its known value for training data) as an extra feature, and similarly for the sequence of the remaining components. In the resulting matrix there were no correlation greater than 0.1, so the option of building classi er chains would not bring signi cant improvement in quality.

The fourth experiment was originally to predict the number of active in ow zones. For each sample in the available training data, the number of active in ow zones was counted and the task of multiclass classi cation was compiled. The prediction accuracy was 1. Having received such a good result, it was proposed to build a version of the cascade classi er, working on the following scheme: 1. We predict the number of working sources. 2. We obtain the probabilities of class 1 for each source separately. 3. Sort the probabilities in descending order. 4. Get the number of sources equal to one at di erent probability thresholds (calibration step). 5. If the number of sources labeled by \1" (i.e., active sources) at a given probability threshold is greater than the predicted number of sources, then put the label \0" for the sources whose probability is the lowest until the number of active in ow points (predicted working sources) becomes equal to their predicted number.

However, this algorithm not only did not reduce the f0; 1g-loss function more than the ensemble, but signi cantly increased it to 0.44. This can be explained by the fact that in the current scheme of the cascade algorithm, we did not process the option when the number of sources is less than that of the predicted ones. In addition, a signi cant part of the probabilities of sources to belong to class 1 is very similar, which does not allow one to exclude only the wrong values.

The fth experiment was designed to use both the initial data and the extracted features from the initial data. The XGBoost method was chosen as an algorithm, the following set of features was used as a feature space: { 300 ICA components applied on the training set transposed time series normalized by Z-score; { 300 PCA components applied to more than 1200 features extracted from the time series; { the number of working sources for this simulation (can be predicted by a simple binary classi er, e.g., logistic regression); The result of this method was the reduction of the loss function error to 0.26, which is the best result in this study.

For the sake of comparison, a series of experiments using deep neural networks was conducted in Keras over Tensor ow. We used both LSTM ([4]) and CNN networks ([6]) as well as their mixture over all 5000 examples given as normalized and concatenated time-series in 4500/500 learning scenario for training and validation. The highest validation accuracy was about 0.59. 7

Conclusion

We considered and tested methods of extracting signi cant features from multivariate time series, methods of data normalization and dimensionality reduction. Several basic algorithms and their ensembles were tested as well as a cascade of two classi cation algorithms was proposed and applied.

The best result, 0.26, in terms of average f0; 1g-loss was shown by the XGBoost method with specially constructed sets of features.

The results of our experiments are summarized in Table 1

Our analysis demonstrates that in ow pro le monitoring using surface measurements is a challenging problem. However, the combination of machine learning techniques allows to get results signi cantly better than random guess. We hope that further enhancement of specially designed methods based on classi er ensembles, relevant deep neural networks architectures and times-series features extraction techniques may further improve the quality of multi-label prediction in the studied problem.

Acknowledgments The work of Dmitry Ignatov (Sections 2, 6, and 7) was supported by the Russian Science Foundation under grant 17-11-01294 and performed at National Research University Higher School of Economics, Russia. 2. Maximilian Christ, Nils Braun, Julius Neu er, and Andreas W. Kempa-Liehr.

Time series feature extraction on basis of scalable hypothesis tests (tsfresh a python package). Neurocomputing, 307:72 { 77, 2018. 3. Marco Fagiani, Stefano Squartini, Leonardo Gabrielli, Marco Severini, and Francesco Piazza. A statistical framework for automatic leakage detection in smart water and gas grids. Energies, 9(9), 2016. 4. Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. Neural Comput., 9(8):1735{1780, November 1997. 5. Dmitry I. Ignatov, Konstantin Sinkov, Pavel Spesivtsev, Ivan Vrabie, and Vladimir Zyuzin. Tree-based ensembles for predicting the bottomhole pressure of oil and gas well ows. In Wil M. P. van der Aalst et al., editor, Analysis of Images, Social Networks and Texts - 7th International Conference, AIST 2018, Moscow, Russia, July 5-7, 2018, Revised Selected Papers, volume 11179 of Lecture Notes in Computer Science, pages 221{233. Springer, 2018. 6. Yann LeCun and Yoshua Bengio. Convolutional networks for images, speech, and time series. In Michael A. Arbib, editor, The Handbook of Brain Theory and Neural Networks, pages 255{258. MIT Press, Cambridge, MA, USA, 1998. 7. Jesse Read, Bernhard Pfahringer, Geo Holmes, and Eibe Frank. Classi er chains for multi-label classi cation. Machine Learning, 85(3):333{359, 2011. 8. Pavel Spesivtsev, Konstantin Sinkov, Ivan Sofronov, Anna Zimina, Alexey Umnov, Ramil Yarullin, and Dmitry Vetrov. Predictive model for bottomhole pressure based on machine learning. Journal of Petroleum Science and Engineering, 166:825 { 841, 2018. 9. Pavel E. Spesivtsev, Andrey D. Kharlashkin, and Konstantin F. Sinkov. Study of the transient terrain-induced and severe slugging problems by use of the drift- ux model. SPE Journal, 22(SPE-186105-PA), 2017. 10. Grigorios Tsoumakas and Ioannis Katakis. Multi-label classi cation: An overview.

IJDWM, 3(3):1{13, 2007.

Tianqi

Chen and

Carlos

Guestrin . Xgboost: A scalable tree boosting system . In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , San Francisco, CA, USA, August 13- 17 , 2016 , pages 785 { 794 , 2016 .