1. Introduction

Workshop on Complex Data Challenges in Earth Observation, July

Multimodal Crop Type Classification Fusing Multi-Spectral Satellite Time Series with Farmers Crop Rotations and Local Crop Distribution

Valentin Barriere

Martin Claverie

0 0 European Commission's Joint Research Center , Via Fermi, 2749, 21027 Ispra VA , Italy

2021

25 2022 0000 0002

Accurate, detailed, and timely crop type mapping is a very valuable information for the institutions in order to create more accurate policies according to the needs of the citizens. In the last decade, the amount of available data dramatically increased, whether it can come from Remote Sensing (using Copernicus Sentinel-2 data) or directly from the farmers (providing in-situ crop information throughout the years and information on crop rotation). Nevertheless, the majority of the studies are restricted to the use of one modality (Remote Sensing data or crop rotation) and never fuse the Earth Observation data with domain knowledge like crop rotations. Moreover, when they use Earth Observation data they are mainly restrained to one year of data, not taking into account the past years. In this context, we propose to tackle a land use and crop type classification task using three data types, by using a Hierarchical Deep Learning algorithm modeling the crop rotations like a language model, the satellite signals like a speech signal and using the crop distribution as additional context vector. We obtained very promising results compared to classical approaches with significant performances, increasing the Accuracy by 5.1 points in a 28-class setting (.948), and the micro-F1 by 9.6 points in a 10-class setting (.887) using only a set of crop of interests selected by an expert. We finally proposed a data-augmentation technique to allow the model to classify the crop before the end of the season, which works surprisingly well in a multimodal setting.

eol>Remote Sensing Farmer's Rotations Multimodal System Hierarchical Model

1. Introduction Timely and accurate crop type mapping provides valuable

information for crop monitoring and productions forecast [1]. In-season crop type mapping can serve not only to better estimate the crop areas, but also to improve the yield forecasting by using crop-type specific models. Earth Observation-based crop type mapping MaCrop type mapping is thus a major information of the chine learning classification methods have been widely crop monitoring systems focusing to in-season forecast tested to derive crop type map from remote sensing data. of the crop production. Among the various methods, Random Forest algorithm

The high-spatial resolution time series enables to deter- has proved its capacity to accurately identify crop type, mine crop type at a sub-parcel level in most agricultural accounting for large and non parametric data set [3]. areas. Most of the remote sensing classification systems Since 2015 and the launch of the first satellite of the relies on supervised techniques, requiring in-situ crop Copernicus Sentinel-2 (S2) constellation, the perspective identification survey. If the survey data are provided for crop type mapping at large scale has changed. The within the season, some systems [2] are designed to pre- high spatial and temporal resolution of S2 ofers indeed dict crop type along the season with a given uncertainty, an appropriate data set to distinguish crop type, based on even if the crop cycle is on-going; such surveys data are the spectral and temporal signals, at parcel or sub-parcel expensive because of the need of labels from the cur- level in most agricultural region. Taking benefit of this rent year to train a model, dificult to achieve at large capacity, some operational systems have been expended scale and in most cases delivered after the cropping sea- [4, 2, 5], combining Earth Observation (EO) data, in situ son. There is a high demand for crop type mapping that observations and classifier algorithm to deliver crop type does not rely on survey data from the on-going season. maps at regional, country scale or continental scale [6].

Such approaches, as the one proposed in this study, are based on model trained with past seasons and applied on the current one, plus we proposed a data-augmentation method to obtain satisfying results earlier in the season. Crop type mapping using Deep-Learning method

The recent progresses in deep-learning benefit the crop type mapping applications. In [7], the authors are classifying crop types at the parcel-level, using the data from the French Brittany during the season 2017. The authors have compared a Transformer-Encoder [8] and a Recurrent Neural Network of type Long-Short-Term-Memory (LSTM) [9]. They obtain comparable results between use of crop rotations and satellite time-series data over the Transformers and the LSTM, obtaining best accuracy several years: [18]. They present a methodology to de(0.69) for the former and macro-F1 (0.59) for the latter. rive near real time Cropland Data Layer over major US

In [10], the authors have designed a crop classifier agricultural states. The methodology is nonetheless reat the parcel-level using S2 and compared several ap- stricted to a limited number of crop types and the use proaches to model the signal, comprising a Transformer of Random Forest classifier, while the recent progress in and a LSTM. They obtain respective overall accuracies deep learning shows tremendous improvements in such between 0.85 and 0.92 using the LSTM depending on the data mining problem. number of classes considered. A similar approach has been run by [11] on 40k Central Europa parcels using S2. They proposed a new early classification mechanism in order to enhance a classical model with an additional stopping probability based on the previously seen information.

Finally, [12] are using the same technique developed in [13], where they tackle the task of crop classification at the pixel level, i.e. accounting for the spatial variation to detect parcels boundaries. They are using a CNN-LSTM network on S2 images to classify 17 types of crops.

Contributions We propose to model both the crop ro

tations and the S2 time series signal in a multimodal way using a hierarchical Long-short-term-memory (LSTM).

The contribution is unique in term of conception as no work has been proposed fusing the large amount of temporally fine-grained EO data with crop rotation analysis in an advanced deep learning method. The crop rotations and the S2 time series were enhanced by the use of the crop distributions of the neighborhood fields picked from previous year. The crop rotations are modeled over the year as words would be in a language model [19], helped Modeling the crop rotation sequences Crop rota- by the S2 time-series data that are modeled as if it was the tion is a widely-used agronomic technique for sustainable prosody of the speaker. Finally, the high-level features farming, preserving the long term soil quality. Good un- we add on the last layer of the network could be seen as derstanding and design of crop rotation are essentials for the distribution of the words used by our speaker. Finally, sustainability and to mitigate the variability of agricul- we also propose a data-augmentation technique for the tural productivity induced by climate change. The crop in-season classification, by randomly cropping the end of rotation depends on the farmer management decision, the RS time-series data. It allows to learn a model able to but some good practices are shared, enabling to model the classify the type of crop without the whole time-series, crop rotation patterns [14]. They remains nonetheless hence before the end of the season. complex and non stable in time; changes may be related to, e.g. economic consideration (commodities price) or 2. Methodology administrative regulation (e.g. subsidies changes). Expert knowledge based models are thus very limited and rarely 2.1. Dataset accurate over large areas and long periods. Alternatively, The study is focused on data acquired over The Netherestimation of the crop sequence probabilities without a lands, covers the period 2009-2020 for the crop type labelpriori using survey data and hidden Markov models has ing and the parcel identification, and the period 2016-2020 been demonstrated in France [? ]. However, survey data for the S2 data. are not always available. Relying on machine learning techniques, [15] use a Markov Logic model in order to Crop Type data predict the following year’s crop in France, with an accuracy of 60%. In [16], the authors focused on deep deep neural networks to reach a maximum accuracy of 88% on a 6-class portion of the US Cropland Data Layer (CDL) dataset over 12 years [17].

The crop type data were obtained from the Dutch Land

Parcel Identification System and GeoSpatial Aid Application, named Basis registratie Percelen (BRP). Dutch farmers must annually record their field parcel boundaries and associated cultivated crops.1 The 12 yearly BRP (2009Motivation A lot of works are focusing on the use of 2020) were merged through geographical polygon interremote sensing to predict the crop type at pixel or parcel sections. The output polygons correspond to the 12-year level using only the EO and in-situ observations of the intersected areas and there are associated with 386 crop current year. Nevertheless, they consider the signal as codes. The polygons which areas are lower than half independents from a year to another. Other works are an hectare were discarded. The output product contains using the crop rotations of the parcels in order to tackle a 974,000 polygons covering a total of 1,600 Mha. pre-season prediction of the crop type, focusing on a few For the evaluation, we propose 3 granularities of labels, classes problem. In this case, it is obvious there is too using several aggregations lead by an expert from the much information missing to reach high performances. domain and yielding to 386, 28 and 12 crop classes. As of 2022, we identified a single study combining the

1https://data.overheid.nl/data/dataset/basisregistratie

gewaspercelen-brp

Data The study relies on the analysis of the optical

Copernicus Sentinel-2 (S2) data. S2 constellation provides observations with a minimum revisit of five days over ten land spectral bands of the optical domain (460-2280 nm), with a spatial resolution of 10-20 meters depending on the bands. The data are processed up to surface reflectance (SR) Level 2A accounting for atmospheric corrections and cloud/cloud-shadow screening using sen2cor algorithm [20]. The data are available though the JEODPP platform [21]. Cloud free SR data were processed to 20-m Leaf Area Index (LAI) and Fraction of Absorbed Photosynthetically Active Radiation (FAPAR) using BV-NET [22] and calibration settings of [23]. For each polygon, B4 (red band) SR, B8A (near infrared band) SR, LAI and FAPAR were averaged at polygon level using pixels in a 20-m inner bufer in order to remove parcel edge efects.

We integrate the EO time series spatially by averaging at the parcel-level, then temporally using a sliding window of size 30 days and a step size of 15 days. For each parcel, this yields to 25 windows for the whole year for each of the Remote Sensing (RS) signal, that we integrated temporally using 7 statistical functionals: mean, standard deviation, 1st quartile, median, 3rd quartile, minimum and maximum. In total we obtain 7*4=28 features per window, leading to 700 features per year.

With this configuration we have overlap between the windows and avoiding to loose information by breaking the signal dynamics, at the price of a bit of redundancy in the features. On each window, we integrated each Time series Smoothing Despite the cloud and cloud- signal using statistical functionals like it would be done shadow screening of L2A S2 products, noise remains for speech data [27]. in the resulting time series [24]. We applied a time series outliers detection based on B4 (for omitted cloud) Spatial Crop Distribution and B8A (for omitted cloud-shadow) and using the Ham- The spatial crop distribution was derived for the year pel filer [ 25]. Filtered data were removed for the four 2019 (year N-1 as compared to the 2020 validation test variables. The filtered time series of the four variables set). For each polygon, we compute the sum of the surface were smoothed using the Whittaker algorithm [26] im- for each crops of the data base included in a 10-km circle plemented by the World Food Program.2 Time series and turned it to percentage. This a-priori distribution of were first resampled and interpolated to a 2-day time crops is proven to be relatively stable in time with minor step and then the Whittaker algorithm using the V-curve change from year to year [28]. We round the probability optimization of the smoothing parameter is applied. It at 10− 4, leading to some values being 0 when not null. yielded to 2-day smoothed time series of each of the four variables, from October N-1 to October N for cropping 2.3. Learning Model season of year N. 2.2. Feature Extractions

Crop Types The crop types labels contains 386 diferent types of crops

over the 12 years of study. We model the crop by a onehot vector of size = 386 and used it as an input to an embedding layer.

2https://github.com/WFP-VAM/vam.whittaker This section describes the learning model and the the features’ integration as observations. Unimodal RNN-LSTM Crop Rotations model We are modeling the crop rotation at the level of a year

by using a LSTM that is trained like a language model. Indeed, it is possible to see each crop like a token in a sentence and train a recurrent neural network that will learn to predict the next word regarding the preceding words. forward and backward hidden states. For a sequence of inputs [RS1 , ..., RS ] it outputs hidden states [h1 , ..., h ]. The attention layer will compute (1) the scalar weights for each of the h (see Equation 5) in order to aggregate them to obtain the final state h (see Equation 6).

= (h ) h = ∑︁ h (5) (6)

Locally aggregated crop distributions We firstly add an embedding layer to transform the

crop type at time into a vector (see Equation 1).

emb = ()

Then we feed this vector into the RNN to produce a hidden state ℎ at time (see Equation 2), which will be used to predict the next crop +1 (see Equation 3). h = (emb|h− 1) (+1|, ..., 1) = (h) (2) (3)

Multimodal model with RS For the first model, we

integrated the RS features at the year-level before the LSTM modeling the crop types. We feed the 700 features into a neural network layer to reduce their size and then concatenate them with the crop embeddings before the LSTM (see Equation 4), using instead of in Equation 2. This model denoted as LSTM Features from RS signal When classifying at the scale of a whole country, the agricultural practices like the type of crops that are used Using only the past rotations to predict the following can change. Typically the distribution of the crop types year’s crop is very dificult, hence we chose to add avail- in a region is a stable value over the years and represent able information from satellite data in order to make the the kind of crops supposed to be found in this part of the model more robust. world. We integrated this local information by adding a

Firstly, we enhance the unimodal LSTM crop model vector representing the distributions over the crop types by adding information from RS and aligned it at the year- in an area corresponding to a circle of 10-km centered level before concatenating the unimodal RS vector with around the studied parcel. the crop embedding. Secondly, we chose to process the We chose to add the distribution vector before the last RS signal beforehand using another RNN and concate- layer because it is a high-level feature regarding the task nated this unimodal RS vector obtained with the crop we are tackling and the deeper you go into the layers the embedding, in a Hierarchical way. Those networks are higher-level the representations are w.r.t. the task [31]. denoted with a Hier- in their name. We concatenated the hidden state h of the LSTM with the crop distribution vector d and mixed them using two fully connected layers 1 and 2 (see Equation 7).

Hence, we obtain h instead of h before the final fully connected layer from Equation 3. This final model is denoted as Final.

h = 2(1([h, d])) (7) emb = [emb, (RS)]

(4) Bidirectional RNN-LSTM with attention to model the RS time-series The first model presented above does not take into account the sequentiality of the RS signal. We decided to correct this aspect by processing the RS features at the year level with a first RNN before adding their yearly representation into the second neural network modeling the crop types, leading to a hierarchical network [29]. This will give 28 features per window RS , for a sequence length of 25 per year.

We chose to enhance a simple LSTM with a bidirectional LSTM (biLSTM) with a self-attention mechanism [30] following the assumption that some parts of the year are more important than others to discriminate the crop type. This model denoted as HierbiLSTM

The biLSTM is composed of 2 LSTM, one of each read the sequence forward and the other reads it backward. The final hidden states are a concatenation of the

3. Experiments and Results

In this section we will describe the diferent experiments and results we ran with all the diferent models. Because of the nature of our predictions, it can be useful to get them before the end of the farming season. In this context, we ran experiments using diferent setups when predicting, we used an end-of-season configuration and an early-classification configuration. For the end-of-season configuration we feed the neural network with all the RS data of the year while in the early-classification configuration we stop to diferent date of the year. We compared using LSTM processing the RS data and tagging at the year-level, seeing all the year in an independent way. This year-independent model obtained state-of-the-art results according to [7] and is denoted as LSTM . 3.1. Experimental protocol

We trained all the networks via mini-batch stochastic gradient descent using Adam as optimizer [32] with a

learning rate of 10− 3 and a cross-entropy loss function. At first glance, we can see that the model using only the The number of neurons for the crop embedding layer, crop rotations can still reach an Accuracy of 73.3% for the both the RNN internal layers, and the fully connected RS 386-class problem even if it does not use any information layer as well as the number of stacked LSTM were from the current year to make it’s prediction. chosen using hyperparameters search. The sizes of the Our RS models reach high results on the 386-class (up layers 1 and 2 are the same than the one from the to 78.7% with the HierbiLSTM model) due to the fact second RNN state h. that, contrary to the main part of the works, they also use

We trained our networks as for a sequence classifica- RS data from the past years on the same parcel, allowing tion task, always with ten years of data. The labels from to model a temporal context. Interestingly, the Hierarchi2018 were used as training set, while the labels from 2019 cal setup with RS only allows for reaching higher results as development set and the labels from 2020 as test set. on the 386-class configuration, going from an accuracy All results presented hereafter refer to the analysis of of 72.5 to 78.7, when compared to the LSTM . 2020 crop types, which are based on models trained with Finally, the local crop distribution vector allow for a the period 2009-2019, thus independent from the 2020 slight improvement, which is more visible in the 10-class crop types observations. We zero-padded when no RS configuration. However, it unexpectedly decreases the data was available (before 2016). macro-F1 while increasing the Accuracy for the 386-class

We proceed to a data-augmentation for the in-season configuration. This can be interpreted as the model makclassification model by cropping randomly the end of the ing more mistakes on non-frequent crops only because timeseries for each batch starting from mid-March. All it’s globally better. An explanation can be that the non models were coded using the PyTorch library [33]. frequent crops are not all situated in the same area, hence their distribution probability density is always approxi

3.2. Results mated as 0.

In this Section we will show the results with two diferent settings: the classical setting where the network sees the whole year of RS signal, and a special early-season setting where the RS signal of the current season stops before the end of the season. In order to deal with unbalanced classes, we used unweighted F1, Precision and Recall as well as the Accuracy. We used also the micro-F1, which is equivalent to Accuracy when having removed classes.

We also present results for 10 classes, which is the 12-class settings without grassland and other crops (see Figure 1).

3.2.1. End-of-Season Classification

The results of the end-of-season classification are available in Table 1. We tested diferent configurations of networks, using diferent kind of features. The best results are obtained with our final model using information from the crop rotations, the S2 time series and the crop distribution of the surrounding fields. 3

3.2.2. Toward In-Season Classification We saw earlier that the RS signal has shown pretty good

end-of-season results, but it is known that the performances are strongly degraded when classifying during the season[11]. In this case, the crop rotations enhanced modality can help.

For the in-season classification, we simply used our model trained over the whole year with data stopping at a point of the year. In Figure 2, we compared the model using RS signal only with the multimodal model. It is important to notice that we used the same "final" model to adapt our domain to this noisy setup. The missing features, corresponding to unused months, were replaced by zeros. The results are thus preliminary and it is expected to obtained poor performances. A straightforward option could be to train new models for each of the evaluated months of the in-season classification.

The multimodal model always outperforms the RS model which is expected, especially at the beginning of the season when almost no information is available using the RS modality.

Labels Thresh .9

4. Analysis

It is also interesting noting that the performances go below the unimodal crop model. This is certainly related that the models may give too much attention to the RS modality compared to the other ones, because the RS data modality has higher impact on the performance as the season progresses. An option to counter this efect would be to use a gate that would discard a noisy modality, as shown in [34, 35].

5. Conclusion and Future works For the sake of clarity, all analyses presented hereafter in this section are limited to a set of crops of interest, corresponding to the 10-class setting. Details on the crops are provided in Figure 1. We presented an innovative study to produce in-season

crop mapping without relying on in-situ data of the current season. The approach relies on the analysis of several High Precision examples modalities, including the crop rotation of the previous years, the Sentinel-2 time series of previous and current We are presenting the results of our model on a fewer year as well as the previous year local crop distributions parcels where the precision is better than normal. In the in the neighborhood parcels. A deep learning algorithm perspective of crop monitoring, this analysis can be very was used to model all those modalities at diferent level valuable. Even if not 100% of the parcels are aggregated, using a Hierarchical LSTM model. Firstly, we modeled the output might support crop yield forecasting system, the RS data with a Bidirectional-LSTM with Attention, through the analysis of the crop specific RS time series using a sliding window on the satellite signals and inwith highest probability. tegrating them using statistical functionals as it can be

We are taking the examples that are classified with done for speech. Secondly, we fed the representation into a probability superior to 0.9 and compute some metrics another LSTM network modeling the crops as words and over them. Those examples represent a big part of the their rotation as sentence as it can be done with a landataset, they are more than 536k for the 12-class dataset guage model. Finally, we added a context vector on the and more than 148k for the 10-class dataset, represent- last layer in order to add information about the geographing respectively 90.0% of all the parcels and 76.5% of ical place of the parcel. The designed methodology was the parcels containing crop of interests. The results are tested over cropland of the Netherlands, benefiting from shown in Table 2. 12 years of crop rotation data nationwide. More generally, In-Season Classification our method outperforms by a great margin the classical state-of-the-art using only a RNN or a Transformer to We compare the vanilla model with the in-season classi- model the EO data at the level of a year. ifcation model trained with our data-augmentation tech- Nevertheless, there is still a lot of place for future work. nique. The vanilla model has only seen during training More spectral bands added in the EO data could improve examples of end-of-season classification, it is normal that the performances. A better way to model the multimodalthey perform worst when used in in-season. This ex- ity, at the level of EO data using multimodal aligned or plains the fact that there is a decrease in performance non-aligned time-series fusion models[36, 37], and at compared to a model only taking into account the crops. a higher level between static representations [34]. Fi

The data-augmentation used for the in-season models nally our model impossible to adapt to an unknown place surprisingly does not work with RS only model, but allow where the crop rotations are not available, a domain adapthe multimodal model to overpass the crop-only model tation method using few-shot learning could be useful in in April. this case [38]. France (2017). ference on Learning Representations (2014) [22] F. Baret, O. Hagolle, B. Geiger, P. Bicheron, B. Miras, 1–13. URL: http://arxiv.org/abs/1412.6980.

M. Huc, B. Berthelot, F. Niño, M. Weiss, O. Samain, doi:http://doi.acm.org.ezproxy.lib. et al., Lai, fapar and fcover cyclopes global products ucf.edu/10.1145/1830483.1830503. derived from vegetation: Part 1: Principles of the arXiv:1412.6980. algorithm, Remote sensing of environment 110 [33] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Brad(2007) 275–286. bury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, [23] M. Claverie, E. F. Vermote, M. Weiss, F. Baret, L. Antiga, et al., Pytorch: An imperative style, highO. Hagolle, V. Demarez, Validation of coarse spatial performance deep learning library, Advances in resolution lai and fapar time series over cropland in neural information processing systems 32 (2019) southwest france, Remote Sensing of Environment 8026–8037.

139 (2013) 216–230. [34] J. Arevalo, T. Solorio, M. Montes-Y-Gómez, F. A. [24] M. Claverie, J. Ju, J. G. Masek, J. L. Dungan, E. F. Ver- González, Gated multimodal units for information mote, J.-C. Roger, S. V. Skakun, C. Justice, The har- fusion, in: 5th International Conference on Learnmonized landsat and sentinel-2 surface reflectance ing Representations, ICLR 2017 - Workshop Track data set, Remote Sensing of Environment 219 (2018) Proceedings, 2017. arXiv:1702.01992. 145–161. [35] M. Chen, S. Wang, P. P. Liang, T. Baltrušaitis, [25] R. K. Pearson, Outliers in process modeling and A. Zadeh, L.-P. Morency, Multimodal sentiment identification, IEEE Transactions on control sys- analysis with word-level fusion and reinforcement tems technology 10 (2002) 55–63. learning, in: Proceedings of the 19th ACM Interna[26] P. H. Eilers, V. Pesendorfer, R. Bonifacio, Automatic tional Conference on Multimodal Interaction, 2017, smoothing of remote sensing data, in: 2017 9th pp. 163–171.

International Workshop on the Analysis of Mul- [36] A. Zadeh, P. P. Liang, N. Mazumder, S. Poria, E. Camtitemporal Remote Sensing Images (MultiTemp), bria, L.-P. Morency, Memory Fusion Network for IEEE, 2017, pp. 1–3. Multi-view Sequential Learning, in: AAAI, 2018. [27] B. Schuller, S. Steidl, A. Batliner, J. Hirschberg, J. K. arXiv:arXiv:1802.00927v1.

Burgoon, A. Baird, A. Elkins, Y. Zhang, E. Coutinho, [37] J. Yang, Y. Wang, R. Yi, Y. Zhu, A. Rehman, K. Evanini, The INTERSPEECH 2016 Computa- A. Zadeh, S. Poria, L.-p. Morency, Unaligned tional Paralinguistics Challenge: Deception, Sincer- Human Multimodal Language Sequences (2020). ity & Native Language, in: Proceedings of the An- arXiv:arXiv:2010.11985v1. nual Conference of the International Speech Com- [38] M. Rußwurm, S. Wang, K. Marco, D. Lobell, Metamunication Association, INTERSPEECH, 2016. Learning for Few-Shot Land Cover Classification, [28] F. A. Merlos, R. J. Hijmans, The scale dependency in: IEEE/CVF conference on computer vision and of spatial crop species diversity and its relation to pattern recognition workshops, 2019. temporal diversity, Proceedings of the National

Academy of Sciences 117 (2020) 26176–26182. [29] I. V. Serban, A. Sordoni, Y. Bengio, A. Courville,

J. Pineau, Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models (2015). URL: http://arxiv.org/abs/1507.04808. doi:10.1017/CBO9781107415324.004.

arXiv:1507.04808. [30] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel,

Y. Bengio, End-to-end attention-based large vocabulary speech recognition, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings 2016-May (2016) 4945–4949. doi:10.1109/ICASSP.2016.

7472618. arXiv:1508.04395. [31] V. Sanh, T. Wolf, S. Ruder, H. Court, H. Row, A

Hierarchical Multi-task Approach for Learning Embeddings from Semantic Tasks, in: AAAI, 2018.

arXiv:arXiv:1811.06031v1. [32] D. Kingma, J. Ba, Adam: A Method for

Stochastic Optimization, International Con