Cost Aware Feature Elicitation Srijita Das Rishabh Iyer Sriraam Natarajan The University of Texas at Dallas The University of Texas at Dallas The University of Texas at Dallas Srijita.Das@utdallas.edu Rishabh.Iyer@utdallas.edu Sriraam.Natarajan@utdallas.edu ABSTRACT tests for reasonably accurate prediction. We build on the intuition Motivated by clinical tasks where acquiring certain features such that given certain observed features like one’s demographic details, as FMRI or blood tests can be expensive, we address the problem of the most important features for a patient depends on the important test-time elicitation of features. We formulate the problem of cost- features for similar patients. Based on this intuition, we find out aware feature elicitation as an optimization problem with trade-off similar data points in the observed feature space and identify the between performance and feature acquisition cost. Our experiments important feature subsets of these similar instances by employing on three real-world medical tasks demonstrate the efficacy and a greedy information theoretic feature selector objective. effectiveness of our proposed approach in minimizing costs and Our contributions in this work are as follows: (1) formalize the maximizing performance. problem as a joint optimization problem of selecting the best feature subset for similar data points and optimizing the loss function using CCS CONCEPTS the important feature subsets. (2) account for acquisition cost in both the feature selector objective and classifier objective to balance • Supervised learning → Budgeted learning; Feature selec- the trade-off between acquisition cost and model performance. (3) tion; • Applications → Healthcare. empirically demonstrate the effectiveness of the proposed approach KEYWORDS on three real-world medical data sets. cost sensitive learning, supervised learning, classification ACM Reference Format: 2 RELATED WORK Srijita Das, Rishabh Iyer, and Sriraam Natarajan. 2020. Cost Aware Feature The related work on cost-sensitive feature selection and learning Elicitation. In Proceedings of KDD Workshop on Knowledge-infused Min- ing and Learning (KiML’20). , 6 pages. https://doi.org/10.1145/nnnnnnn. can be categorized into the following four broad approaches. nnnnnnn Tree based budgeted learning: Prediction time elicitation of fea- tures under a cost budget has been widely studied in literature. A lot of work has been done in tree based models [5, 16, 17, 26–28] 1 INTRODUCTION by adding cost term to the tree objective function in either deci- In supervised classification setting, every instance has a fixed fea- sion trees or ensemble methods like gradient boosted trees. All ture vector and a discriminative function is learnt on such fixed- these methods aim to build an adaptive and complex decision tree length feature vector and it’s corresponding class variable. However, boundary by considering trade-off between performance and test- a lot of practical problems like healthcare, network domains, de- time feature acquisition cost. While we are similar in motivation to signing survey questionnaire [19, 20] etc has an associated feature these approaches, our methodology is different in the sense that acquisition cost. In such domains, there is a cost budget and get- we do not consider tree based models. Instead our approach aims ting all the features of an instance can be very costly. As a result, to find local feature subsets using an information theoretic feature many cost sensitive classifier models [2, 8, 24] have been proposed selector for different clusters of training instance build in a lower in literature to incorporate the cost of acquisition into the model dimensional space. objective during training and prediction. Adaptive classification and dynamic feature discovery: Our Our problem is motivated by such a cost-aware setting where the work also draws inspiration from Nan al.’s work [15] where they assumption is that prediction time features have an acquisition cost learn a high performance costly model and approximate the model’s and adheres to a strict budget. Consider a patient visiting a doctor performance adaptively by building a low cost model and gating for some potential diagnosis of a disease. For such a patient, infor- function which decides which model to use for specific training in- mation like age, gender, ethnicity and other demographic features stances. This adaptive switching between low and high cost model are easily available at zero cost. However, various lab tests that the takes care of the trade-off between cost and performance. Our patient needs to undergo incurs cost. So, a training model should be method is different from theirs because we do not maintain a high able to identify the most relevant (i.e. those which are most infor- cost model which is costly to build and and difficult to decide. We mative, yet least costly) lab tests that are required for each specific refine the parameters of a single low cost model by incorporating a patient. The intuition of this work is that different patients, depend- cost penalty in the feature selector and model objective. Our work ing on their history, ethnicity, age and gender, may require different is also along the direction of Nan et al.’s work [18] where they select varying feature subsets for test instance using neighbourhood in- In M. Gaur, A. Jaimes, F. Ozcan, S. Shah, A. Sheth, B. Srivastava, Proceedings of the Workshop on Knowledge-infused Mining and Learning (KDD-KiML 2020). San Diego, formation of the training data. While calculating the neighborhood California, USA, August 24, 2020. Use permitted under Creative Commons License information from training data is similar to building clusters in Attribution 4.0 International (CC BY 4.0). our approach, the training neighborhood for our method is on just KiML’20, August 24, 2020, San Diego, California, USA, © 2020 Copyright held by the author(s). the observed feature space. Moreover, we incorporate the neigh- https://doi.org/10.1145/nnnnnnn.nnnnnnn bourhood information in the training algorithm whereas Nan et KiML’20, August 24, 2020, San Diego, California, USA, Srijita Das, Rishabh Iyer, and Sriraam Natarajan al.’s work is a prediction time algorithm. Ma et al. [10] also address observed features to find similar instances in the training set and this problem of dynamic discovery of features based on generative identify the important feature subsets for each of these clusters modelling and Bayesian experimental design. based on a feature selector objective function which balances the Feature elicitation using Reinforcement learning: There is trade-off between choosing the important features and the cost at another line of work along the sequential decision making liter- which these features are acquired. ature [4, 9, 22] to model the test time elicitation of features by learning the optimal policy of test feature acquisition. Along this direction, our work aligns with the work of Shim et al. [25] where 3.2 Proposed solution they jointly train a classifier and RL agent together. Their classifier As a first step, we cluster the training instances based on just the objective function is similar to our method with a cost penalty, observed zero cost feature set O. The intuition is that instances however they use a Deep RL agent to figure out the policy. We on with similar features will also have similar characteristics in terms the other hand use localised feature selector to find the important of which elicitable features to order. For example, in a medical appli- feature subsets for the underlying training clusters in the observed cation, whether to request for a blood test or a ct-scan will depend feature space. on factors such as age, gender, ethnicity and whether patients with Active Feature Acquisition: Our problem set-up is also inspired similar demographic features had requested these tests. Also, since by work along active feature acquisition [13, 14, 19, 23, 29] where the feature set O, comes at zero cost, we assume that for unseen certain feature subsets are observed and rest are acquired at a cost. test instances, this feature set is observed. While all the above mentioned work follow this problem set up during training time and typically use active learning to seek infor- mative instances at every iteration, we use this particular setting for test instances. Unlike their work, all the training instances in our work are fully observed and the assumption is that the feature acquisition cost has already being paid during training. Also, we address a supervised classification problem instead of an active learning set up. Our problem set up is similar to Kanani et al. [6] as they also have partial test instances, however their problem is that of instance acquisition where the acquired feature subset is fixed. Figure 1: Optimization framework for the proposed problem Our method aims at discovering variable length feature subsets for various underlying clusters. Our contributions: Although the problem of prediction time fea- ture elicitation has been explored in literature from various direc- tions and with various assumptions, we come up with an intu- itive solution to this problem and formulate the problem in a two We propose a model which consists of a parameterized feature step optimization framework. We incorporate acquisition cost selector module 𝐹 (𝑋, E𝑐𝑖 , 𝛼) which takes in a set of input instances in both the feature selector and model objectives to balance the 𝐸𝑐𝑖 belonging to the cluster 𝑐𝑖 based on the feature set O and pro- performance and cost trade-off. The problem set up is naturally duces a subset 𝑋 of most important features for the classification applicable in real world health care and other domains where the task. The feature selection model is based on an information- theo- knowledge of the observed features also needs to be accounted retic objective function and is augmented with the feature cost to while selecting the elicitable features . account for the trade off between model performance and acquisi- tion cost at test-time. The output feature subset from the feature 3 COST AWARE FEATURE ELICITATION selector module are used to update the parameters of the classifier. The optimization framework is shown in Figure 1 3.1 Problem setup Information theoretic Feature selector model: The feature Given: A dataset {(𝑥 1, 𝑦1 ), · · · , (𝑥𝑛 , 𝑦𝑛 )} with each 𝑥𝑖 ∈ R𝑑 as the selector module selects the best subset of features for each cluster feature set. Each feature has an associated cost 𝑟𝑖 . of training data based on an information theoretic objective score. Objective: Learn a discriminative model which is aware of the fea- Since at test time, we do not know the elicitable feature subset E ture costs and can balance the trade-off between feature acquisition (since the goal of feature selection is in the first place to find the cost and model performance. truly necessary features for learning). Hence we propose to use the We make an additional assumption here that there is a subset of fea- closest set of instances in the training data to the current instance. tures which have 0 cost. These could be, for example, demographic Since we assume that the training data has already been elicited, information (e.g. age, gender, etc) in a medical domain which are we have all the features observed in the training data. We compute easily available/less cumbersome to obtain as compared to other this distance just based on the observed feature set O. We cluster features. In other words, we can partition the feature set F = O ∪ E the training data based on the observed features into m clusters where O are the zero cost observed features and E are the elicitable 𝑐 1, 𝑐 2, · · · 𝑐𝑚 . Next, we use the Minimum-Redundancy-Maximum features which can be acquired at a cost. We also assume that the Relevance (MRMR) feature Selection paradigm [1, 21]. We denote training data is completely available with all features (i.e. the cost parameters [𝛼𝑐1𝑖 , 𝛼𝑐2𝑖 , 𝛼𝑐3𝑖 , 𝛼𝑐4𝑖 ] as parameters of a particular cluster for all the features has already been paid). The goal is to use these 𝑐𝑖 . The feature selection module is a function of the parameters of KiML’20, August 24, 2020, San Diego, California, USA, Cost Aware Feature Elicitation the cluster to which a set of instances belong and is defined as: where 𝜆1 and 𝜆2 are hyper-parameters. In the above equation, 𝜃 Õ is the parameter of the model and can be updated by standard 𝐹 (𝑋, E𝑐𝑖 , 𝛼𝑐𝑖 ) = 𝛼𝑐1𝑖 𝐼 (E𝑝 ; 𝑌 ) gradient based techniques. This loss function takes into account the E𝑝 ∈𝑋 important feature subset for each cluster and updates the parameter accordingly. The classifier objective also consists of a cost term | {z } max. relevance denoted by 𝑐 (𝑋𝛼𝑖 ) to account for the cost of the selected feature subset. For hard budget on the elicited features, the cost component Õ © Õ Õ − ­𝛼𝑐2𝑖 𝐼 (E 𝑗 ; E𝑝 ) − 𝛼𝑐3𝑖 𝐼 (E𝑝 ; E 𝑗 |𝑌 ) ® ª E𝑝 ∈𝑋 « E 𝑗 ∈𝑋 E 𝑗 ∈𝑋 (1) in the model objective can be considered. In case of a cost budget, this component can be ignored because the elicited feature subset ¬ | {z } Õ min. redundancy adheres to a fixed cost and hence, this term is constant. − 𝛼𝑐4𝑖 𝑐 (E𝑝 ) E𝑝 ∈𝑋 3.3 Algorithm | {z } We present the algorithm for Cost Aware Feature Elicitation cost penalty (CAFE) in Algorithm 1. CAFE takes as input set of training examples where 𝐼 (E𝑝 ; 𝑌 ) is the mutual information between the random vari- E, the zero cost feature set O, the elicitable feature subset E, a cost able E𝑝 (feature) and 𝑌 (target). In the above equation, the feature vector 𝑀 ∈ R𝑑 and a budget 𝐵. Each element in the training set E subset 𝑋 is grown greedily using a greedy optimization strategy consists of a tuple (𝑥, 𝑦) where 𝑥 ∈ R𝑑 is the feature vector and y maximizing the above objective function. In equation 1, E𝑝 denotes is the label. a single feature from the elicitable set E that is considered for eval- The training instances E are clustered based on just the observed uation based on the subset 𝑋 grown so far. The first term is the feature set O using K-means clustering (Cluster). For every cluster mutual information between each feature and the class variable 𝑌 . 𝑐𝑖 , the training instances belonging to the cluster is assigned to In a discriminative task, this value should be maximized. The sec- the set E𝑐𝑖 and is passed to the Feature Selector module (lines 6-8). ond term is the pairwise mutual information between each feature The FeatureSelector function takes E𝑐𝑖 , parameter 𝛼, the feature to be evaluated and the features already added to the feature subset subsets O and E, cost vector 𝑀 and a predefined budget 𝐵 as input 𝑋 . This value needs to be minimized for selecting informative fea- and returns the most important feature subset X𝑐𝛼𝑖 corresponding tures. The third term is called the conditional redundancy [1] and to a cluster 𝑐𝑖 . A greedy optimization technique is used to grow this term needs to be maximized. The last term adds the penalty the feature subset 𝑋 of every cluster based on the feature selector for cost of every feature and ensures the right trade-off between objective function defined in Equation 1. The FeatureSelector cost, relevance and redundancy. For this work, we do not learn the terminates once the budget 𝐵 is exhausted or the mutual informa- parameters 𝛼𝑐𝑖 for each cluster, instead fix these parameters to 1. tion score becomes negative. Once all the important feature subsets We leave the learning of these parameters to future work. are obtained for all the |𝐶 | clusters, the model objective function is In the problem setup, since the 0 cost feature subset is always optimized as mentioned in Equation 3 for all the training instances present, we always consider the observed feature subset O in ad- using the important feature subsets for the clusters to which the dition to the most important feature subset as returned by the training instances belong (lines 12-18). All the remaining features Feature selector objective. We also account for the knowledge of are imputed by using either 0 or any other imputation model be- the observed features while growing the informative feature subset fore training the model. The final training model G(E O∪𝑋𝛼 , 𝛼, 𝜃 ) through greedy optimization. Specifically, while calculating the is an unified model used to make predictions for a test-instance pairwise mutual information between the features and the condi- consisting of just the observed feature subset O. tional redundancy term (second and third term of equation 1), we also evaluate the mutual information of the features with these 4 EMPIRICAL EVALUATION observed features. It is to be noted that in cases where the observed We did experiments with 3 real world medical data sets. The in- features are not discriminative enough of the target, the feature se- tuition of CAFE makes more sense in medical domains, hence our lector module ensures that the elicitable features with maximum choice of data sets. However, the idea can be applied to other do- relevance to the target variable are picked. mains ranging from logistics to resource allocation task. Table 2 Optimization Problem: The cost aware feature selector jots down the various features of the data sets used in our experi- 𝐹 (𝑋, E𝑐𝑖 , 𝛼) for a given set of instance E𝑐𝑖 belonging to a specific ments. Below are the details of the 3 real data sets, we use for our cluster 𝑐𝑖 solves the following optimization problem: experiments. 𝑋𝛼𝑖 = argmax𝑋 ⊆ E 𝐹 (𝑋, E𝑐𝑖 , 𝛼) (2) 1. Parkinson’s disease prediction: The Parkinson’s Progression Marker Initiative (PPMI) [12] is an observational study where the For a given instance (𝑥, 𝑦), we denote 𝐿(𝑥, 𝑦, 𝑋, 𝜃 ) as the loss aim is to identify Parkinson’s disease progression from various function using a subset 𝑋 of the features as obtained from the types of features. The PPMI data set consists of various features Feature selector optimization problem. The optimization problem related to various motor functions and non-motor behavioral and for learning the parameters of a classifier can be posed as: psychological tests. We consider certain motor assessment features 𝑛 like rising from chair, gait, freezing of gait, posture and postural sta- bility as observed features and rest all features as elicitable features Õ min 𝐿(𝑥𝑖 , 𝑦𝑖 , 𝑋𝛼𝑖 , 𝜃 ) + 𝜆1𝑐 (𝑋𝛼𝑖 ) + 𝜆2 ||𝜃 || 2 (3) 𝜃 𝑖=1 which must be acquired at a cost. KiML’20, August 24, 2020, San Diego, California, USA, Srijita Das, Rishabh Iyer, and Sriraam Natarajan Algorithm 1 Cost Aware Feature Elicitation 1: function CAFE(E, O, E, 𝑀, 𝐵) 2: E = E O∪E ⊲ E consists of 0 cost features O and costly features E 3: 𝐶 = Cluster(E O ) ⊲ Clustering based on the observed features O 4: X = {∅} ⊲ Stores best feature subsets of each cluster 5: for 𝑖 = 1 to |𝐶 | do ⊲ Repeat for every cluster 6: E𝑐𝑖 = GetClusterMember(E, 𝐶, 𝑖) 7: ⊲ get the data points belonging to each cluster 𝑐𝑖 8: X𝑐𝛼𝑖 = FeatureSelector(E𝑐𝑖 , 𝛼, O, E, 𝑀, 𝐵) 9: ⊲ Parameterized feature selector for each cluster 10: X = X ∪ {X𝑐𝛼𝑖 ∪ O} 11: end for 12: for 𝑖 = 1 to |𝐶 | do ⊲ Repeat for every cluster 13: X𝑐𝛼𝑖 = GetFeatureSubset(X, 𝑖) Figure 2: Recall Vs number of clusters for Rare disease for 14: ⊲ Get the feature subset for each cluster 𝑐𝑖 CAFE-I 15: for 𝑗 = 1 to |E𝑐𝑖 | do ⊲ Repeat for every data point in cluster 𝑐𝑖 16: Optimize 𝐽 (𝑥 𝑗 , 𝑦 𝑗 , X𝑐𝛼𝑖 , 𝜃, 𝑀) and built upon it. We consider two variants of CAFE:(1) CAFE in 17: ⊲ Optimize the objective function in Equation 3 which we replace the missing and unimportant features of every 18: Update 𝜃 ⊲ Update the model parameter 𝜃 cluster with 0 and then update the classifier parameters (2) CAFE-I 19: end for where we replace the missing and unimportant features by using an 20: end for imputation model learnt from the already acquired feature values return G(E O∪𝑋𝛼 , 𝛼, 𝜃 ) ⊲ G is the training model built on E of other data points. A simple imputation model is used where we 21: end function replace the missing features with mode for categorical features and mean for numeric features. Baselines: We consider 3 baselines for evaluating CAFE and 2. Alzheimer’s disease prediction: The Alzheimer’s Disease Neu- CAFE-I: (1) using the observed and zero cost features to update roIntiative (ADNI1 ) is a study that aims to test whether various the training model denoted as OBS (2) using a random subset of clinical, FMRI and biomarkers can be used to predict the early onset fixed number of elicitable features and all the observed features of Alzheimer’s disease. In this data set, we consider the demograph- to update the training model denoted as RANDOM. For this baseline, ics of the patients as observed and zero cost features and the FMRI the results are averaged over 10 runs. (3) using the information image data and cognitive score data as unobserved and elicitable theoretic feature selector score as defined in Equation 1 to select features. the ’k’ best elicitable features on the entire data without any cluster 3. Rare disease prediction This data set is created from survey consideration along with the observed features denoted as KBEST. questionnaires [11] and the task here is to predict whether a person We keep the value of ’k’ to be the same as that used by CAFE. has rare disease or not. The demographic features are observed Although some of the existing methods could be potential baselines, while other sensitive questions in the survey regarding technology none of these methods match the exact setting of our problem, hence use, health and disease related meta information is considered to we do not compare our method against them. be elicitable. Results: We aim to answer the following questions: Evaluation Methodology: All the data sets were partitioned Q1: How does CAFE and CAFE-I with hard budget on features into a 80:20 train-test split. Hyper parameters like the number of compare against the standard baselines? clusters on the observed features were picked by doing 5 fold cross Q2: How does the cost-sensitive version of CAFE and CAFE-I validation on all the data sets. The optimal number of clusters fare against the cost-sensitive baseline KBEST? picked were 6 for ADNI, 9 for Rare disease data set and 7 for the PPMI data set. For the results reported in Table 1, we considered a The results reported in Table 1 suggests both CAFE and CAFE- hard budget on the number of elicitable features and set it to half I significantly outperform the other baselines in almost all the of the total number of features in the respective data set. We use K- metrics for Rare disease and PPMI data set. For ADNI, CAFE and means clustering as the underlying clustering algorithm. For all the CAFE-I outperform the other baselines in clinically relevant recall reported results, we use an underlying Support Vector Machine [3] metric while KBEST performs the best for the other metrics. The classifier with Radial basis kernel function. Since, all the data sets reason for this is that in ADNI, since, the elicitable features are are highly imbalanced, hence we consider metrics like recall, F1, image features and we discretize the image features to calculate AUC-ROC and precision for our reported results. For the Feature the information gain for the feature selector module, the granular selector module, we used the existing implementation of Li et al. [7] level feature information is lost because of this discretization and hence the drop in performance. For the experiments in Table 1, 1 www.loni.ucla.edu/ADNI we keep the budget to be approximately half of the total number KiML’20, August 24, 2020, San Diego, California, USA, Cost Aware Feature Elicitation Data set Algorithm Recall F1 AUC-ROC AUC-PR OBS 0.647 0.488 0.642 0.347 RANDOM 0.57 ± 0.064 0.549± 0.059 0.693 ± 0.042 0.421 ± 0.051 Rare disease KBEST 0.47 0.457 0.628 0.349 CAFE 0.647 0.628 0.749 0.489 CAFE-I 0.647 0.647 0.759 0.512 OBS 0.765 0.685 0.741 0.563 RANDOM 0.857 ± 0.023 0.809 ± 0.015 0.85 ± 0.013 0.712 ± 0.020 PPMI KBEST 0.828 0.807 0.846 0.716 CAFE 0.846 0.817 0.855 0.726 CAFE-I 0.855 0.829 0.865 0.743 OBS 0.5 0.44 0.553 0.365 RANDOM 0.711 ± 0,043 0.697 ± 0.082 0.767 ± 0.064 0.592 ± 0.098 ADNI KBEST 0.73 0.745 0.806 0.646 CAFE 0.807 0.711 0.786 0.578 CAFE-I 0.769 0.701 0.776 0.574 Table 1: Comparison of CAFE against other baseline methods on 3 real data sets Dataset # Pos # Neg # Observed # Elicitable 5 CONCLUSION PPMI 554 919 5 31 ADNI 94 287 6 69 In this paper, we pose the prediction time feature elicitation problem Rare Disease 87 232 6 63 as an optimization problem by employing a cluster specific feature Table 2: Data set details of the 3 real data sets used.#Pos is num- selector to choose the best feature subset and then optimizing the ber of positive example, #Neg is number of negative example. # Ob- training loss. We show the effectiveness of our approach in real data served is number of observed features and # Elicitable is the maxi- sets where the problem set up is intuitive. Future work includes mum number of features that can be acquired. learning the parameters of the feature selector module and jointly optimizing the feature selector and model parameters for a more robust framework and adding more constraints to optimization. of features for all the methods. On an average, CAFE-I performs ACKNOWLEDGEMENTS better than CAFE across all the data sets because of the underlying SN & SD gratefully acknowledge the support of NSF grant IIS- imputation model which helps in better treatment of the missing 1836565. Any opinions, findings and conclusion or recommenda- values as against replacing all the features by 0. This answers Q1 tions are those of the authors and do not necessarily reflect the affirmatively. view of the US government. In Figure 3, we compare the cost version of CAFE and CAFE-I against KBEST. Cost version takes into account the cost of individ- ual features and accounts for them as penalty in the feature selector REFERENCES [1] Gavin Brown, Adam Pocock, Ming-Jie Zhao, and Mikel Luján. 2012. Conditional module. Hence, in this version of CAFE, a cost budget is used as likelihood maximisation: a unifying framework for information theoretic feature opposed to hard budget on the number of elicitable features. We gen- selection. JMLR (2012). erate the cost vector by sampling each cost component uniformly [2] Xiaoyong Chai, Lin Deng, Qiang Yang, and Charles X Ling. 2004. Test-cost sensitive naive bayes classification. In ICDM. from (0,1). For PPMI and Rare disease, we can see that cost sensitive [3] Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine CAFE performs consistently better than KBEST with increasing learning (1995). cost budget. In the PPMI data set, the greedy optimization of the [4] Gabriel Dulac-Arnold, Ludovic Denoyer, Philippe Preux, and Patrick Gallinari. 2011. Datum-wise classification: a sequential approach to sparsity. In ECML feature selector objective on the entire data set lead to elicitation of PKDD. 375–390. just 1 feature, beyond that the information gain was negative, hence [5] Tianshi Gao and Daphne Koller. 2011. Active classification based on value of classifier. In NIPS. the performance of PPMI across various cross budget remains the [6] P. Kanani and P. Melville. 2008. Prediction-time active feature-value acquisition same. CAFE on the other hand was able to select important feature for cost-effective customer targeting. Workshop on Cost Sensitive Learning at subsets for various clusters based on the observed features related NIPS (2008). [7] Jundong Li, Kewei Cheng, Suhang Wang, Fred Morstatter, Robert P Trevino, to gait and postures. For ADNI data set, CAFE performs better than Jiliang Tang, and Huan Liu. 2018. Feature selection: A data perspective. ACM KBEST only in recall. The reason for this is the same as mentioned Computing Surveys (CSUR) (2018). above. This helps in answering Q2 affirmatively. [8] Charles X Ling, Qiang Yang, Jianning Wang, and Shichao Zhang. 2004. Decision trees with minimal costs. In ICML. Lastly, Figure 2 shows the effect of increasing cluster on the [9] D. J. Lizotte, O. Madani, and R. Greiner. 2003. Budgeted learning of Naive-Bayes validation recall for the Rare disease data set. As can be seen, for classifiers (UAI). 378–385. [10] Chao Ma, Sebastian Tschiatschek, Konstantina Palla, Jose Miguel Hernandez- smaller number of clusters, the recall is very low and increases to Lobato, Sebastian Nowozin, and Cheng Zhang. 2019. EDDI: Efficient Dynamic an optimum for 9 clusters. This helps us in understanding the fact Discovery of High-Value Information with Partial VAE. In ICML. that forming clusters based on observed important features helps [11] H. MacLeod, S. Yang, et al. 2016. Identifying rare diseases from behavioural data: a machine learning approach (CHASE). 130–139. CAFE in selecting different feature subsets for different clusters, [12] K. Marek, D. Jennings, et al. 2011. The Parkinson Progression Marker Initiative thus helping the learning procedure. (PPMI). Prog Neurobiol 95, 4 (2011), 629–635. KiML’20, August 24, 2020, San Diego, California, USA, Srijita Das, Rishabh Iyer, and Sriraam Natarajan Figure 3: Recall (left), F1 (middle), AUC-PR (right) for (from top to bottom) Rare Disaese, PPMI, and ADNI. The x-axis refers to the cost budget used which leads to the elicitation of different number of features. [13] P. Melville, M. Saar-Tsechansky, et al. 2004. Active feature-value acquisition for (2005), 1226–1238. classifier induction (ICDM). 483–486. [22] Thomas Rückstieß, Christian Osendorfer, and Patrick van der Smagt. 2011. Se- [14] P. Melville, M. Saar-Tsechansky, et al. 2005. An expected utility approach to quential feature selection for classification. In Australasian Joint Conference on active feature-value acquisition (ICDM). 745–748. Artificial Intelligence. Springer, 132–141. [15] Feng Nan and Venkatesh Saligrama. 2017. Adaptive classification for prediction [23] M. Saar-Tsechansky, P. Melville, and F. Provost. 2009. Active feature-value under a budget. In NIPS. acquisition. Manag Sci 55, 4 (2009). [16] Feng Nan, Joseph Wang, and Venkatesh Saligrama. 2015. Feature-budgeted [24] Victor S Sheng and Charles X Ling. 2006. Feature value acquisition in testing: a random forest. In ICML. sequential batch test algorithm. In ICML. [17] Feng Nan, Joseph Wang, and Venkatesh Saligrama. 2016. Pruning random forests [25] Hajin Shim, Sung Ju Hwang, and Eunho Yang. 2018. Joint active feature acquisi- for prediction on a budget. In NIPS. tion and classification with variable-size set encoding. In NIPS. [18] Feng Nan, Joseph Wang, Kirill Trapeznikov, and Venkatesh Saligrama. 2014. Fast [26] Joseph Wang, Kirill Trapeznikov, and Venkatesh Saligrama. 2015. Efficient learn- margin-based cost-sensitive classification. In ICASSP. ing by directed acyclic graph for resource constrained prediction. In NIPS. [19] Sriraam Natarajan, Srijita Das, Nandini Ramanan, Gautam Kunapuli, and Predrag [27] Zhixiang Xu, Matt Kusner, Kilian Weinberger, and Minmin Chen. 2013. Cost- Radivojac. 2018. On Whom Should I Perform this Lab Test Next? An Active sensitive tree of classifiers. In ICML. Feature Elicitation Approach.. In IJCAI. [28] Zhixiang Xu, Kilian Q Weinberger, and Olivier Chapelle. 2012. The greedy miser: [20] S. Natarajan, A. Prabhakar, et al. 2017. Boosting for postpartum depression learning under test-time budgets. In ICML. prediction (CHASE). 232–240. [29] Z. Zheng and B. Padmanabhan. 2002. On active learning for data acquisition [21] Hanchuan Peng, Fuhui Long, and Chris Ding. 2005. Feature selection based (ICDM). 562–569. on mutual information criteria of max-dependency, max-relevance, and min- redundancy. IEEE Transactions on pattern analysis and machine intelligence 27, 8