A neural network strategy for supervised classification via the Learning Under Privileged Information paradigm Ludovica Sacco1 , Dino Ienco2,4 and Roberto Interdonato3,4 1 DIMES - University of Calabria, P. Bucci 41C, 87036 Rende (CS), Italy 2 INRAE, UMR TETIS, Univ. Montpellier, Montpellier, France 3 CIRAD, UMR TETIS, Montpellier 4 TETIS, Univ. of Montpellier, APT, Cirad, CNRS, INRAE, Montpellier, France Abstract Devising new methodologies to handle and analyse Big Data has become a fundamental task in our increasingly service-oriented and interconnected society. One of the problems arising while handling such data is that, for a given set of entities, not all the entities may be described at the same level of detail, i.e., the number of features describing each entity may vary. In general cases, in order to apply classic data science methods, it is necessary to have a common features set over a data set. This will then correspond to the maximum number of common features among the entities, resulting in a loss of information for the entities for which additional information may be available. In order to exploit such additional information, the Learning Using Privileged Information (LUPI) paradigm has been proposed, based on the use of the teacher role in the learning process. In this schema the teacher acquires a strategic position, by exploiting at the training stage some additional privileged information about the entities, which will not be available at the test stage. In this work, we apply this paradigm in the context of neural networks, by proposing a LUPI based deep learning architecture able to exploit a larger set of attributes at training time, with the aim to improve classification performances on a set of entities associated to a reduced attribute set. Experimental results show how the proposed approach improves upon the ones applying the same schema to classic machine learning methods (e.g., SVM). Keywords Big Data, LUPI paradigm, Neural Networks 1. Introduction The developments achieved in recent years in machine learning, due both to theoretical studies and to the constant increase of computational power of the processors, have led to several advancements on the traditional learning techniques. Moreover, these years have also seen a sudden increase in the quantity of available data. Massive quantities of information are produced, collected and digitally stored everyday, in contexts that touch different domains SEBD 2021: The 29th Italian Symposium on Advanced Database Systems, September 5-9, 2021, Pizzo Calabro (VV), Italy " l.sacco@dimes.unical.it (L. Sacco); dino.ienco@inrae.fr (D. Ienco); roberto.interdonato@cirad.fr (R. Interdonato)  0000-0003-0235-6273 (L. Sacco); 0000-0002-8736-3132 (D. Ienco); 0000-0002-0536-6277 (R. Interdonato) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) and aspects of human life (human health and behavior, agriculture, biology, and so on). The availability of these so called big data, together with that of machines that can handle higher volumes of data than ever before, have also led to an increase in the use of supervised learning techniques, i.e., techniques that rely on labeled data in order to train a machine learning model. However, in practical contexts, these data are generally noisy, and may include a diversity of features that are not always easy to exploit for computational analyses. A typical scenario is that in which entities of the same type are associated to a non homogeneous set of attributes that describe them. When using traditional machine learning methods, this would require the need to reduce the set of the attributes exploited to learn the model to just the common ones, thus resulting in a significant loss of information. One of the solutions that have been proposed in order to overcome this problem, and allow the use of such diverse set of attributes, is the Learning Using Privileged Information paradigm (LUPI) paradigm [1]. The idea behind LUPI is to introduce a teacher-student model in the learning process, that allows to train a model on a larger set of attributes than that available at testing time. More specifically, the idea is to select a subset of entities for which a larger set of attributes is available (i.e., the privileged information), and then to train a model that can use such information to discern among entities that are associated to a reduced set of attributes (i.e., entities for which the priviliged information is not available, but just a limited set of regular attributes). The original implementations of the LUPI paradigm where introduced on classic machine learning techniques, e.g., Support Vector Machines (SVM) [1]. However, this paradigm can also be integrated in more advanced frameworks, such as the ones based on deep learning architectures [2], that are nowadays the state of the art in several domains. In this paper, we present a LUPI based deep learning approach based on a three-branch architecture where a first model (M1) is trained on the full set of attributes (regular and privileged information), a second one (M2) exploits the representation learnt by M1 in order to stretch the representation learnt from regular attributes towards the one obtained from privileged ones, and a third model (M3) is optimized on the regular attributes only. The final feature set is then obtained by combining the otput of M2 and M3. While the proposed architecture is implemented by using a classic Multi-Layer Perceptron to instantiate the three models, it is general enough to be extended to different neural network models (e.g., Convolutional Neural Networks and Recurrent Neural Networks), and to be effectively applied to data coming from different domains. Experimental evaluation on six real world dataset show how the proposed methodology improves upon a competitor that applies the LUPI paradigm to an SVM approach (SVM+). The paper is organized as follows: Section 2 discusses the related work, section 3 introduces the proposed architecture, Section 4 discusses experimental results, while Section 5 concludes the work. 2. Related Work The first algorithm that implements the LUPI paradigm is an extension of the classic SVM algorithm, known as SVM+ [1, 3]. SVM+ proposes a modified formulation of SVM in order to evaluate the misclassification of the training example through a function that learns from the privileged information as slack variables. Successively, different improvements and alternatives have found place into supervised learning problems. In [4] it has been remarked that the LUPI paradigm has always been implemented with the L-2 support vector machine, with the result of the number of the tuning parameters doubled and a consequent increase of the computational cost. So the decision to employ the L-1 norm, widely used for feature and variable selection [5], brought to the creation of L-1 SVM and its use in LUPI. Some amelioration to the optimization problem of the SVM+ is provided with the presentation of two new SMO-style algorithms [6]: aSMO and gSMO. When large training sets are used, both show to be more efficient in terms of generalization error/running treade-off than the generic optimizer LOQO, based on the interior point optimizers. Further similar algorithms can be found in [7], with some comparison between aSMO and caSMO. Considerations on differentiating easy examples from the difficult ones are discussed and formalized in [8], where a new approach in choosing weights associated with every training example is presented. More recent works [9, 10] discuss the importance of the knowledge transfer concept into SVM+ and neural networks to improve their convergence properties and therefore accelerating the speed of Student’s learning. Privileged information plays an important role in domains like computer vision, where several studies have successfully introduced the LUPI paradigm [11, 12]. Other applications are based on the use of bounding-boxes, image captions and descriptions as additional information and combine it with the Rank Transfer techniques to overcome the limitations of the SVM+ [13]. In recent years, other studied use the privileged information often associated to neural networks and exploited to solve various tasks. Classifying fine-grained images or recognizing images from a data set containing annotation noise is possible thanks to the DeepLUPI framework [14]. In [15] a two-stream fully convolutional network, named MIML-FCN+, is proposed to solve the multi-instance multi-label problem by using privileged bags of labels. A LUPI framework for CNNs and RNNs is offered in [16], where a heteroscedastic dropout is used and the privileged information is represented by the variance of the dropout. In this work, while we exploit a simple Multi-Layer Perceptron to instantiate the models at the base of our architecture, we propose a deep learning framework that is general enough to be extended to different neural network models (e.g., Convolutional Neural Networks and Recurrent Neural Networks), and to be effectively applied to data coming from different domains. 3. The LUPI Architecture 3.1. Learning Using Privileged Information Paradigm In the classic machine learning paradigm, the role of a teacher in the human learning process has been underestimated. The supervised learning task considers a simple strategy. Given a set of pairs (𝑥1 , 𝑦1 ), ..., (𝑥𝑙 , 𝑦𝑙 ), 𝑥𝑖 ∈ 𝑋, 𝑦𝑖 ∈ {−1, 1}, where the vector 𝑥𝑖 ∈ 𝑋 is the description of the example and 𝑦𝑖 its classification, finding the function 𝑦 = 𝑓 (𝑥, 𝛼* ) that minimizes the probability of incorrect classifications. In the LUPI paradigm the goal is exactly the same, but it contemplates the knowledge provided by the teacher at the training stage, the so called Privileged Information. So triplets (𝑥1 , 𝑥*1 , 𝑦1 ), ..., (𝑥𝑙 , 𝑥*𝑙 𝑦𝑙 ), 𝑥𝑖 ∈ 𝑋, 𝑥*𝑖 ∈ 𝑋 * , 𝑦𝑖 ∈ {−1, 1} are now considered instead of pairs and each one is independently generated by some underlying unknown distribution 𝑃 (𝑥, 𝑥* , 𝑦). 3.2. Implementation The proposed LUPI based deep learning architecture (Figure 1) is an end-to-end framework consisting of three neural networks (e.g. multi-layer perceptron) that process the input data to produce the final classification. Figure 1: The general overview of the proposed LUPI based architecture. The privileged information is identified by 𝑋 * and the regular training example by 𝑋. The architecture has three different streams, two of which (M2 and M3) run in parallel, and their outputs are combined to get the final classification. The first stream analyzes the training examples 𝑋 (i.e., the entity with the associated regular information) with the addition of the privileged information 𝑋 * , while the other two streams work only on the training example 𝑋. This choice allows to exploit the knowledge acquired when the privileged information is available. In particular, the distribution of the labels got as an output from the execution of the first model (M1) is channeled into the second neural network. From this configuration, the second model (M2) uses as input a set of pairs (𝑥𝑖 , 𝑦𝑝 ) where 𝑥𝑖 ∈ 𝑋 and 𝑦𝑝 ∈ {−1, 1} and 𝑦𝑝 corresponds to the output obtained from M1. Moreover, to train the second neural network we choose the Kullback Leibler divergence [17] to let M2 behaves as similar as possible to M1. Finally, the third stream manages the training examples without any additional information to help during the classification. It should be noted that the M1 model is executed separately from the other two, and that M2 and M3 are run in parallel, as part of a single multi-output neural network. Regarding the loss function the Categorical Cross Entropy has been used, except for M2, where the Kullback Leibler divergence is employed with the aim to mimic the behavior of the M1 model trained with the privileged information. 4. Experiments This section reports the experimental setup, the results achieved and the data used to evaluate our architecture. To evaluate the proposed LUPI architecture, we carried out the experiments on six dataset, from different domains. We describe how we preprocessed them for the utilization in the experimental phase. We introduce setup details for both our implementation and competitors, specifying the metrics adopted for the comparison. Finally we discuss the results, aiming to validate our architecture. 4.1. Datasets Six well-known datataset were used in the experiments : • COIL20 [18]: image dataset including 20 different objects. For each of them 72 images are included in the size 128x128. The background has been discarded. • HAR [19]: sensors data, built through a smartphone, from the recording of 30 subjects, performing various activities of daily living and containing six different activities: walking, walking upstairs, walking downstairs, sitting, standing and laying. • Isolet [20]: 150 subjects spoke the name of each letter of the alphabet twice. Hence, there are 52 training examples from each speaker. • landsat [21]: dataset of agricultural land images in Australia to classify seven different soil classes constituting dissimilar soil types. • USPS [22]: 9298 images belonging to 10 different classes with size 16x16. • waveform [23]: generator generating 3 classes of waves. Each class is generated from a combination of 2 to 3 base waves. Structural characteristics of these datasets are summarized in Table 1. Table 1 Datasets used to perform the experiments. # Features # Objects # Classes Type COIL20 1024 1440 20 Image HAR 561 10299 6 Sensor Isolet 617 1560 26 Speech landsat 36 2859 7 Image USPS 256 9298 10 Image waveform 40 5000 3 Numbers 4.2. Setup In order to evaluate the significance of the proposed approach, we provide an experimental evaluation including tests on different percentages of privileged information and a comparison with the performance of the SVM+ algorithm [1, 3]. For training and evaluation purposes, the datasets were divided into three parts: training, validation and test set, respectively representing the 50%, the 20% and the 30% of the instances (i.e., rows of the dataset). In line with recent deep learning literature, the model obtaining the best performance on the validation set has been used for the evaluation on the test set. The privileged information is provided by selecting a portion of the available features for each dataset, i.e., a certain number of columns. In detail, indices are randomly shuffled and then 9 different percentages of attributes are taken into account: 5%, 10%, 25%, 30%, 50%, 60%, 70%, 80%, 90%. The indices are selected in an incremental fashion, so that the higher percentages contain the indices included in the lower ones. Bear in mind that the percentage of privileged and “regular” information are complementary, so that for a certain percentage 𝑥 of privileged information, the number of attributes available at testing time will be the (100 − 𝑥)% of total attributes. In our setting, this means that higher percentages of privileged information will correspond to harder test stages, because lower quantities of information will be available at testing time. Obviously this assumption is only related to our need to “artificially” partition the available attributes in privileged and regular, and does not hold in real world cases, where higher quantities of privileged information should correspond to better performances. In order to avoid the bias induced by this selection process, these operations are executed 10 times, in order to get 10 different configurations of privileged indices for each considered percentage of privileged information. In relation to the deep learning parameters, the number of epochs is set to 100 and the learning rate of the Adam optimizer is equal to 1 × 10−4 [24]. While the metrics identified to evaluate our architecture are Accuracy and F-measure. Regarding the internal structure of the LUPI architecture, the three models, M1, M2 and M3 are built using a multi-layer perceptron neural network (MLP). Each MLP has an input, three internal layers and an output layer, with respectively 256, 192, 128, 64 neural units and the number of classes of each dataset as the units for the output layer. As the activation function, the Rectified Linear Unit (ReLU) is employed for each layer, a part from the output one which utilized the SoftMax function to normalize the output vector values so that their sum is equal to 1. The experiments are run 30 times for each configuration of the different privileged indices, then mean and standard deviation are taken into account to evaluate the results. The reported experiments are carried out on the following platform: Intel(R) Core(TM) i5-7300U CPU @ 2.60GHz 2.71 GHz with 8,00 GB of RAM. The LUPI based deep learning architecture is implemented in Python Tensorflow 2.0 library. The source code for SVM+ is written in C and it is available online 1 . 4.3. Experimental Results Table 2 and table 3 report the average result in term of Accuracy and F-Measure for the 6 different datasets. It can be noted that the proposed approach has better performances for percentages of privileged information lower or equal to 30%. This result is actually not surprising since, as explained in Section 4.2, higher percentages of privileged information correspond to harder test stages (i.e., lower quantities of information are available at testing time). For higher percentages of privileged information, performances are slightly lower, but still comparable for most datasets, showing the robustness of the proposed approach even in the cases when only 10% of the total attributes are available at testing time (i.e., 90% of privileged information). 1 https://github.com/cypw/svm-plus Table 2 Average and standard deviation accuracy performance of the proposed LUPI based architecture as the percentage of privileged information in the dataset (columns in the table) increases. 5% 10% 25% 30% 50% 60% 70% 80% 90% COIL20 98.17 ± 98.19 ± 98.18 ± 98.21 ± 98.19 ± 98.14 ± 98.05 ± 98.06 ± 97.31 ± 0.44 0.44 0.50 0.43 0.49 0.45 0.48 0.49 0.46 HAR 94.15 ± 94.08 ± 93.74 ± 93.65 ± 93.23 ± 92.66 ± 92.21 ± 91.05 ± 86.44 ± 0.87 0.86 0.89 0.70 0.71 0.83 0.79 0.79 0.87 Isolet 91.60 ± 91.51 ± 91.08 ± 91.02 ± 90.11 ± 89.43 ± 88.30 ± 86.81 ± 81.01 ± 0.86 0.80 0.75 0.82 0.84 0.87 0.92 0.93 1.21 landsat 88.82 ± 88.60 ± 88.14 ± 88.02 ± 86.97 ± 86.41 ± 85.57 ± 84.78 ± 79.34 ± 0.94 0.93 0.90 0.95 0.97 0.96 1.01 0.96 0.85 USPS 96.49 ± 96.52 ± 96.44 ± 96.42 ± 96.34 ± 96.21 ± 95.87 ± 94.79 ± 91.72 ± 0.26 0.23 0.20 0.23 0.19 0.23 0.21 0.26 0.34 waveform 85.66 ± 85.36 ± 83.54 ± 82.95 ± 79.85 ± 76.95 ± 73.94 ± 68.91 ± 54.50 ± 0.65 0.63 0.63 0.64 0.63 0.55 0.57 0.53 0.65 Table 3 Average and standard deviation F-Measure performance of the proposed LUPI based architecture as the percentage of privileged information in the dataset (columns in the table) increases. 5% 10% 25% 30% 50% 60% 70% 80% 90% COIL20 98.22 ± 98.26 ± 98.22 ± 98.27 ± 98.25 ± 98.19 ± 98.14 ± 98.13 ± 97.39 ± 0.49 0.49 0.55 0.50 0.53 0.50 0.52 0.53 0.53 HAR 94.02 ± 93.97 ± 93.65 ± 93.54 ± 93.08 ± 92.57 ± 92.09 ± 90.95 ± 86.37 ± 0.95 0.92 0.95 0.77 0.78 0.88 0.85 0.89 0.95 Isolet 91.90 ± 91.83 ± 91.40 ± 91.35 ± 90.45 ± 89.82 ± 88.67 ± 87.10 ± 81.34 ± 0.92 0.82 0.81 0.85 0.94 0.92 1.00 0.96 1.27 landsat 86.51 ± 86.15 ± 85.56 ± 85.37 ± 83.79 ± 82.99 ± 81.63 ± 80.29 ± 73.70 ± 1.48 1.52 1.45 1.58 1.63 1.65 1.88 1.91 1.56 USPS 96.04 ± 96.08 ± 96.02 ± 96.03 ± 95.92 ± 95.88 ± 95.54 ± 94.32 ± 90.89 ± 0.31 0.31 0.18 0.22 0.22 0.24 0.36 0.33 0.44 waveform 85.63 ± 85.34 ± 83.48 ± 82.91 ± 79.77 ± 76.83 ± 73.64 ± 67.63 ± 51.07 ± 0.70 0.68 0.67 0.70 0.65 0.60 0.70 0.87 1.61 Actually, the only dataset showing a significance degradation in performances for higher percentages of privileged information is waveform, which is also the one corresponding to the lowest performances for all configuration. A reason for this behavior may be found in its synthetic nature. On the contrary, the dataset who produces the best results is COIL20, which is the dataset including the highest number of features, also corresponding to a relatively low number of objects. This probably allows to obtain a better classification, thanks to the large amount of available information at the training time over a relatively small numbers of tuples. As a general observation, all the other datasets show good Accuracy and F-Measure results, confirming the effectiveness of the proposed approach on datasets coming from different application domains. Table 4 shows the average Accuracy results for the SVM+ algorithm. It can be noted how in most cases the performances begin to degrade already for a PI around the 10%, making SVM+ Table 4 Average accuracy performance of the SVM+ as the percentage of privileged information in the dataset (columns in the table) increases. 5% 10% 25% 30% 50% 60% 70% 80% 90% COIL20 96.70 96.17 96.16 96.14 96.23 96.28 96.33 95.80 95.03 HAR 94.15 93.55 92.97 92.58 91.65 90.86 90.27 88.43 81.63 Isolet 91.30 90.79 90.45 90.30 89.72 89.02 88.18 86.71 81.13 landsat 85.73 85.19 84.65 84.92 84.74 84.02 83.61 81.42 73.27 USPS 96.38 96.25 96.07 96.01 95.81 95.67 95.26 93.90 89.71 waveform 82.96 82.95 81.97 81.40 77.81 72.77 69.97 63.79 48.81 significantly less robust to our approach. To ease the comparison between the two approaches, in Figure 2 we present a visual comparison between our architecture and SVM+ in terms of Accuracy. It is evident how the LUPI architecture dominates the scene, clearly outperforming SVM+ in most datasets. The only two datasets in which the performances are not clearly superior (remain still comparable) are USPS and Isolet. Figure 2: Accuracy comparison between our LUPI architecture and the SVM+ algorithm The scenario described so far reflects the reality, regarding the Teacher-Student metaphor on which the LUPI paradigm is based. Too high percentages of PI would not allow the student to develop such skills to adequately face the testing phase. While percentages of up to 30% represent the right measure to develop one’s own method and acquire such knowledge as to be able to build precise classification models. Our architecture proves to be able to fully perform this task, providing a valid structure for an optimal learning up to 30% of PI, also allowing to improve the performance in terms of Accuracy compared to the SVM+ algorithm. 5. Conclusions In this work, we introduced a new deep learning architecture based on the Learning Using Privileged Information (LUPI) paradigm. The LUPI paradigm allows to exploit extra information that may be available only for a subset of the total objects in a dataset, i.e., the so called privileged information. More specifically, a model is trained by exploiting such privileged information, which is then able to classify test examples associated only to regular one. Experimental performances on six well-known datasets show the significance of our approach, proving its robustness to different available quantities of privileged and regular information. Experimental results also showed how the proposed deep learning architecture outperforms a competitor integrating the LUPI paradigm on a classic machine learning method (SVM+). While in this work, in order to test the significance of our approach, we resorted to an artificial partitioning of the available information between privileged and regular one, in the future we plan to test the architecture on real-world case studies, corresponding to specific application domains. A first example can be that of remote sensing, when the availability of different sensors on specific areas may be exploited as privileged information (e.g., exploiting drone images associated to a limited geographical area, associated to satellite images available on larger portions of the scene). From a methodological point of view, we also plan to extend our approach by instantiating the three models included in the architectures with different neural networks (e.g., Recurrent and Convolutional ones). 6. Acknowledgments This work was supported by the Regione Calabria, as part of the Ph.D scolarship of Ludovica Sacco. References [1] V. Vapnik, A. Vashist, A new learning paradigm: Learning using privileged information, Neural Networks 22 (2009) 544–557. doi:10.1016/j.neunet.2009.06.042. [2] Y. LeCun, Y. Bengio, G. E. Hinton, Deep learning, Nat. 521 (2015) 436–444. URL: https: //doi.org/10.1038/nature14539. doi:10.1038/nature14539. [3] V. Vapnik, A. Vashist, N. Pavlovitch, Learning using hidden information (learning with teacher), in: Proceedings of the 2009 International Joint Conference on Neural Networks, IJCNN’09, IEEE Press, 2009, p. 1252–1259. [4] L. Niu, Y. Shi, J. Wu, Learning using privileged information with l-1 support vector machine, in: 2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology, IEEE, 2012. doi:10.1109/wi-iat.2012.52. [5] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning, Springer New York, 2009. doi:10.1007/978-0-387-84858-7. [6] D. Pechyony, R. Izmailov, A. Vashist, V. Vapnik, Smo-style algorithms for learning using privileged information., 2010, pp. 235–241. [7] D. Pechyony, V. Vapnik, Fast Optimization Algorithms for Solving SVM+, 2011, pp. 27–42. doi:10.1201/b11429-5. [8] M. Lapin, M. Hein, B. Schiele, Learning using privileged information: SVM and weighted SVM, Neural Networks 53 (2014) 95–108. doi:10.1016/j.neunet.2014.02.002. [9] V. Vapnik, R. Izmailov, Learning using privileged information: Similarity control and knowledge transfer, Journal of Machine Learning Research 16 (2015) 2023–2049. URL: http://jmlr.org/papers/v16/vapnik15b.html. [10] V. Vapnik, R. Izmailov, Knowledge transfer in SVM and neural networks, Annals of Math- ematics and Artificial Intelligence 81 (2017) 3–19. doi:10.1007/s10472-017-9538-x. [11] J. Feyereisl, S. Kwak, J. Son, B. Han, Object localization based on structural svm us- ing privileged information, in: Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems, vol- ume 27, Curran Associates, Inc., 2014. URL: https://proceedings.neurips.cc/paper/2014/file/ 6c4b761a28b734fe93831e3fb400ce87-Paper.pdf. [12] W. Li, L. Niu, D. Xu, Exploiting privileged information from web data for image catego- rization, in: Computer Vision – ECCV 2014, Springer International Publishing, 2014, pp. 437–452. doi:10.1007/978-3-319-10602-1_29. [13] V. Sharmanska, N. Quadrianto, C. H. Lampert, Learning to rank using privileged in- formation, in: 2013 IEEE International Conference on Computer Vision, IEEE, 2013. doi:10.1109/iccv.2013.107. [14] M. Chevalier, N. Thome, G. Hénaff, M. Cord, Classifying low-resolution images by in- tegrating privileged information in deep CNNs, Pattern Recognition Letters 116 (2018) 29–35. doi:10.1016/j.patrec.2018.09.007. [15] H. Yang, J. Zhou, J. Cai, Y. Ong, Miml-fcn+: Multi-instance multi-label learning via fully convolutional networks with privileged information, 2017, pp. 5996–6004. doi:10.1109/ CVPR.2017.635. [16] J. Lambert, O. Sener, S. Savarese, Deep learning under privileged information using heteroscedastic dropout (2018). arXiv:1805.11614. [17] S. Kullback, R. A. Leibler, On information and sufficiency, The Annals of Mathematical Statistics 22 (1951) 79–86. doi:10.1214/aoms/1177729694. [18] S. A. Nene, S. K. Nayar, H. Murase, Columbia university image library (coil-20) (1996). URL: http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php. [19] J. L. Reyes-Ortiz, D. Anguita, A. Ghio, L. Oneto, X. Parra, A public domain dataset for human activity recognition using smartphones, 21th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (2013). [20] R. Cole, M. Fanty, Spoken letter recognition, n Advancesin Neural Information Processing Systems (1991) 220–226. [21] Statlog landsat satellite data set: Machine learning repository, Univ. California, Irvine, Irvine, CA, USA (1993). [22] J. Hull, A database for handwritten text recognition research, IEEE Transactions on Pattern Analysis and Machine Intelligence 16 (1994) 550–554. doi:10.1109/34.291440. [23] L. Breiman, J. Friedman, R. Olshen, C. Stone, Classification and regression trees (1984) 49–55. [24] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, 2017. arXiv:1412.6980.