Cost Aware Feature Elicitation
                      Srijita Das                                             Rishabh Iyer                             Sriraam Natarajan
       The University of Texas at Dallas                         The University of Texas at Dallas               The University of Texas at Dallas
           Srijita.Das@utdallas.edu                                Rishabh.Iyer@utdallas.edu                     Sriraam.Natarajan@utdallas.edu

ABSTRACT                                                                                tests for reasonably accurate prediction. We build on the intuition
Motivated by clinical tasks where acquiring certain features such                       that given certain observed features like one’s demographic details,
as FMRI or blood tests can be expensive, we address the problem of                      the most important features for a patient depends on the important
test-time elicitation of features. We formulate the problem of cost-                    features for similar patients. Based on this intuition, we find out
aware feature elicitation as an optimization problem with trade-off                     similar data points in the observed feature space and identify the
between performance and feature acquisition cost. Our experiments                       important feature subsets of these similar instances by employing
on three real-world medical tasks demonstrate the efficacy and                          a greedy information theoretic feature selector objective.
effectiveness of our proposed approach in minimizing costs and                             Our contributions in this work are as follows: (1) formalize the
maximizing performance.                                                                 problem as a joint optimization problem of selecting the best feature
                                                                                        subset for similar data points and optimizing the loss function using
CCS CONCEPTS                                                                            the important feature subsets. (2) account for acquisition cost in
                                                                                        both the feature selector objective and classifier objective to balance
   • Supervised learning → Budgeted learning; Feature selec-
                                                                                        the trade-off between acquisition cost and model performance. (3)
tion; • Applications → Healthcare.
                                                                                        empirically demonstrate the effectiveness of the proposed approach
KEYWORDS                                                                                on three real-world medical data sets.
    cost sensitive learning, supervised learning, classification
ACM Reference Format:                                                                   2   RELATED WORK
Srijita Das, Rishabh Iyer, and Sriraam Natarajan. 2020. Cost Aware Feature
                                                                                        The related work on cost-sensitive feature selection and learning
Elicitation. In Proceedings of KDD Workshop on Knowledge-infused Min-
ing and Learning (KiML’20). , 6 pages. https://doi.org/10.1145/nnnnnnn.                 can be categorized into the following four broad approaches.
nnnnnnn                                                                                 Tree based budgeted learning: Prediction time elicitation of fea-
                                                                                        tures under a cost budget has been widely studied in literature. A
                                                                                        lot of work has been done in tree based models [5, 16, 17, 26–28]
1    INTRODUCTION
                                                                                        by adding cost term to the tree objective function in either deci-
In supervised classification setting, every instance has a fixed fea-                   sion trees or ensemble methods like gradient boosted trees. All
ture vector and a discriminative function is learnt on such fixed-                      these methods aim to build an adaptive and complex decision tree
length feature vector and it’s corresponding class variable. However,                   boundary by considering trade-off between performance and test-
a lot of practical problems like healthcare, network domains, de-                       time feature acquisition cost. While we are similar in motivation to
signing survey questionnaire [19, 20] etc has an associated feature                     these approaches, our methodology is different in the sense that
acquisition cost. In such domains, there is a cost budget and get-                      we do not consider tree based models. Instead our approach aims
ting all the features of an instance can be very costly. As a result,                   to find local feature subsets using an information theoretic feature
many cost sensitive classifier models [2, 8, 24] have been proposed                     selector for different clusters of training instance build in a lower
in literature to incorporate the cost of acquisition into the model                     dimensional space.
objective during training and prediction.                                               Adaptive classification and dynamic feature discovery: Our
   Our problem is motivated by such a cost-aware setting where the                      work also draws inspiration from Nan al.’s work [15] where they
assumption is that prediction time features have an acquisition cost                    learn a high performance costly model and approximate the model’s
and adheres to a strict budget. Consider a patient visiting a doctor                    performance adaptively by building a low cost model and gating
for some potential diagnosis of a disease. For such a patient, infor-                   function which decides which model to use for specific training in-
mation like age, gender, ethnicity and other demographic features                       stances. This adaptive switching between low and high cost model
are easily available at zero cost. However, various lab tests that the                  takes care of the trade-off between cost and performance. Our
patient needs to undergo incurs cost. So, a training model should be                    method is different from theirs because we do not maintain a high
able to identify the most relevant (i.e. those which are most infor-                    cost model which is costly to build and and difficult to decide. We
mative, yet least costly) lab tests that are required for each specific                 refine the parameters of a single low cost model by incorporating a
patient. The intuition of this work is that different patients, depend-                 cost penalty in the feature selector and model objective. Our work
ing on their history, ethnicity, age and gender, may require different                  is also along the direction of Nan et al.’s work [18] where they select
                                                                                        varying feature subsets for test instance using neighbourhood in-
In M. Gaur, A. Jaimes, F. Ozcan, S. Shah, A. Sheth, B. Srivastava, Proceedings of the
Workshop on Knowledge-infused Mining and Learning (KDD-KiML 2020). San Diego,           formation of the training data. While calculating the neighborhood
California, USA, August 24, 2020. Use permitted under Creative Commons License          information from training data is similar to building clusters in
Attribution 4.0 International (CC BY 4.0).
                                                                                        our approach, the training neighborhood for our method is on just
KiML’20, August 24, 2020, San Diego, California, USA,
© 2020 Copyright held by the author(s).                                                 the observed feature space. Moreover, we incorporate the neigh-
https://doi.org/10.1145/nnnnnnn.nnnnnnn                                                 bourhood information in the training algorithm whereas Nan et
KiML’20, August 24, 2020, San Diego, California, USA,
                                                                                                             Srijita Das, Rishabh Iyer, and Sriraam Natarajan


al.’s work is a prediction time algorithm. Ma et al. [10] also address       observed features to find similar instances in the training set and
this problem of dynamic discovery of features based on generative            identify the important feature subsets for each of these clusters
modelling and Bayesian experimental design.                                  based on a feature selector objective function which balances the
Feature elicitation using Reinforcement learning: There is                   trade-off between choosing the important features and the cost at
another line of work along the sequential decision making liter-             which these features are acquired.
ature [4, 9, 22] to model the test time elicitation of features by
learning the optimal policy of test feature acquisition. Along this
direction, our work aligns with the work of Shim et al. [25] where           3.2     Proposed solution
they jointly train a classifier and RL agent together. Their classifier      As a first step, we cluster the training instances based on just the
objective function is similar to our method with a cost penalty,             observed zero cost feature set O. The intuition is that instances
however they use a Deep RL agent to figure out the policy. We on             with similar features will also have similar characteristics in terms
the other hand use localised feature selector to find the important          of which elicitable features to order. For example, in a medical appli-
feature subsets for the underlying training clusters in the observed         cation, whether to request for a blood test or a ct-scan will depend
feature space.                                                               on factors such as age, gender, ethnicity and whether patients with
Active Feature Acquisition: Our problem set-up is also inspired              similar demographic features had requested these tests. Also, since
by work along active feature acquisition [13, 14, 19, 23, 29] where          the feature set O, comes at zero cost, we assume that for unseen
certain feature subsets are observed and rest are acquired at a cost.        test instances, this feature set is observed.
While all the above mentioned work follow this problem set up
during training time and typically use active learning to seek infor-
mative instances at every iteration, we use this particular setting
for test instances. Unlike their work, all the training instances in
our work are fully observed and the assumption is that the feature
acquisition cost has already being paid during training. Also, we
address a supervised classification problem instead of an active
learning set up. Our problem set up is similar to Kanani et al. [6] as
they also have partial test instances, however their problem is that
of instance acquisition where the acquired feature subset is fixed.
                                                                             Figure 1: Optimization framework for the proposed problem
Our method aims at discovering variable length feature subsets for
various underlying clusters.
Our contributions: Although the problem of prediction time fea-
ture elicitation has been explored in literature from various direc-
tions and with various assumptions, we come up with an intu-
itive solution to this problem and formulate the problem in a two                 We propose a model which consists of a parameterized feature
step optimization framework. We incorporate acquisition cost                 selector module 𝐹 (𝑋, E𝑐𝑖 , 𝛼) which takes in a set of input instances
in both the feature selector and model objectives to balance the             𝐸𝑐𝑖 belonging to the cluster 𝑐𝑖 based on the feature set O and pro-
performance and cost trade-off. The problem set up is naturally              duces a subset 𝑋 of most important features for the classification
applicable in real world health care and other domains where the             task. The feature selection model is based on an information- theo-
knowledge of the observed features also needs to be accounted                retic objective function and is augmented with the feature cost to
while selecting the elicitable features .                                    account for the trade off between model performance and acquisi-
                                                                             tion cost at test-time. The output feature subset from the feature
3 COST AWARE FEATURE ELICITATION                                             selector module are used to update the parameters of the classifier.
                                                                             The optimization framework is shown in Figure 1
3.1 Problem setup                                                                 Information theoretic Feature selector model: The feature
Given: A dataset {(𝑥 1, 𝑦1 ), · · · , (𝑥𝑛 , 𝑦𝑛 )} with each 𝑥𝑖 ∈ R𝑑 as the   selector module selects the best subset of features for each cluster
feature set. Each feature has an associated cost 𝑟𝑖 .                        of training data based on an information theoretic objective score.
Objective: Learn a discriminative model which is aware of the fea-           Since at test time, we do not know the elicitable feature subset E
ture costs and can balance the trade-off between feature acquisition         (since the goal of feature selection is in the first place to find the
cost and model performance.                                                  truly necessary features for learning). Hence we propose to use the
We make an additional assumption here that there is a subset of fea-         closest set of instances in the training data to the current instance.
tures which have 0 cost. These could be, for example, demographic            Since we assume that the training data has already been elicited,
information (e.g. age, gender, etc) in a medical domain which are            we have all the features observed in the training data. We compute
easily available/less cumbersome to obtain as compared to other              this distance just based on the observed feature set O. We cluster
features. In other words, we can partition the feature set F = O ∪ E         the training data based on the observed features into m clusters
where O are the zero cost observed features and E are the elicitable         𝑐 1, 𝑐 2, · · · 𝑐𝑚 . Next, we use the Minimum-Redundancy-Maximum
features which can be acquired at a cost. We also assume that the            Relevance (MRMR) feature Selection paradigm [1, 21]. We denote
training data is completely available with all features (i.e. the cost       parameters [𝛼𝑐1𝑖 , 𝛼𝑐2𝑖 , 𝛼𝑐3𝑖 , 𝛼𝑐4𝑖 ] as parameters of a particular cluster
for all the features has already been paid). The goal is to use these        𝑐𝑖 . The feature selection module is a function of the parameters of
                                                                                                                   KiML’20, August 24, 2020, San Diego, California, USA,
Cost Aware Feature Elicitation


the cluster to which a set of instances belong and is defined as:                       where 𝜆1 and 𝜆2 are hyper-parameters. In the above equation, 𝜃
                                Õ                                                       is the parameter of the model and can be updated by standard
       𝐹 (𝑋, E𝑐𝑖 , 𝛼𝑐𝑖 ) = 𝛼𝑐1𝑖   𝐼 (E𝑝 ; 𝑌 )                                           gradient based techniques. This loss function takes into account the
                                      E𝑝 ∈𝑋                                             important feature subset for each cluster and updates the parameter
                                                                                        accordingly. The classifier objective also consists of a cost term
                                 |         {z          }
                                     max. relevance
                                                                                        denoted by 𝑐 (𝑋𝛼𝑖 ) to account for the cost of the selected feature
                                                                                        subset. For hard budget on the elicited features, the cost component
             Õ ©         Õ                           Õ
        −         ­𝛼𝑐2𝑖        𝐼 (E 𝑗 ; E𝑝 ) − 𝛼𝑐3𝑖        𝐼 (E𝑝 ; E 𝑗 |𝑌 ) ®
                                                                            ª
            E𝑝 ∈𝑋 «     E 𝑗 ∈𝑋                      E 𝑗 ∈𝑋                      (1)     in the model objective can be considered. In case of a cost budget,
                                                                                        this component can be ignored because the elicited feature subset
                                                                            ¬
            |                            {z                                }
                  Õ                    min. redundancy                                  adheres to a fixed cost and hence, this term is constant.
        − 𝛼𝑐4𝑖             𝑐 (E𝑝 )
                 E𝑝 ∈𝑋                                                                   3.3    Algorithm
            |         {z        }                                                       We present the algorithm for Cost Aware Feature Elicitation
                cost penalty                                                            (CAFE) in Algorithm 1. CAFE takes as input set of training examples
where 𝐼 (E𝑝 ; 𝑌 ) is the mutual information between the random vari-                    E, the zero cost feature set O, the elicitable feature subset E, a cost
able E𝑝 (feature) and 𝑌 (target). In the above equation, the feature                    vector 𝑀 ∈ R𝑑 and a budget 𝐵. Each element in the training set E
subset 𝑋 is grown greedily using a greedy optimization strategy                         consists of a tuple (𝑥, 𝑦) where 𝑥 ∈ R𝑑 is the feature vector and y
maximizing the above objective function. In equation 1, E𝑝 denotes                      is the label.
a single feature from the elicitable set E that is considered for eval-                      The training instances E are clustered based on just the observed
uation based on the subset 𝑋 grown so far. The first term is the                        feature set O using K-means clustering (Cluster). For every cluster
mutual information between each feature and the class variable 𝑌 .                      𝑐𝑖 , the training instances belonging to the cluster is assigned to
In a discriminative task, this value should be maximized. The sec-                      the set E𝑐𝑖 and is passed to the Feature Selector module (lines 6-8).
ond term is the pairwise mutual information between each feature                        The FeatureSelector function takes E𝑐𝑖 , parameter 𝛼, the feature
to be evaluated and the features already added to the feature subset                    subsets O and E, cost vector 𝑀 and a predefined budget 𝐵 as input
𝑋 . This value needs to be minimized for selecting informative fea-                     and returns the most important feature subset X𝑐𝛼𝑖 corresponding
tures. The third term is called the conditional redundancy [1] and                      to a cluster 𝑐𝑖 . A greedy optimization technique is used to grow
this term needs to be maximized. The last term adds the penalty                         the feature subset 𝑋 of every cluster based on the feature selector
for cost of every feature and ensures the right trade-off between                       objective function defined in Equation 1. The FeatureSelector
cost, relevance and redundancy. For this work, we do not learn the                      terminates once the budget 𝐵 is exhausted or the mutual informa-
parameters 𝛼𝑐𝑖 for each cluster, instead fix these parameters to 1.                     tion score becomes negative. Once all the important feature subsets
We leave the learning of these parameters to future work.                               are obtained for all the |𝐶 | clusters, the model objective function is
    In the problem setup, since the 0 cost feature subset is always                     optimized as mentioned in Equation 3 for all the training instances
present, we always consider the observed feature subset O in ad-                        using the important feature subsets for the clusters to which the
dition to the most important feature subset as returned by the                          training instances belong (lines 12-18). All the remaining features
Feature selector objective. We also account for the knowledge of                        are imputed by using either 0 or any other imputation model be-
the observed features while growing the informative feature subset                      fore training the model. The final training model G(E O∪𝑋𝛼 , 𝛼, 𝜃 )
through greedy optimization. Specifically, while calculating the                        is an unified model used to make predictions for a test-instance
pairwise mutual information between the features and the condi-                         consisting of just the observed feature subset O.
tional redundancy term (second and third term of equation 1), we
also evaluate the mutual information of the features with these                          4     EMPIRICAL EVALUATION
observed features. It is to be noted that in cases where the observed                   We did experiments with 3 real world medical data sets. The in-
features are not discriminative enough of the target, the feature se-                   tuition of CAFE makes more sense in medical domains, hence our
lector module ensures that the elicitable features with maximum                         choice of data sets. However, the idea can be applied to other do-
relevance to the target variable are picked.                                            mains ranging from logistics to resource allocation task. Table 2
    Optimization Problem: The cost aware feature selector                               jots down the various features of the data sets used in our experi-
𝐹 (𝑋, E𝑐𝑖 , 𝛼) for a given set of instance E𝑐𝑖 belonging to a specific                  ments. Below are the details of the 3 real data sets, we use for our
cluster 𝑐𝑖 solves the following optimization problem:                                   experiments.

                           𝑋𝛼𝑖 = argmax𝑋 ⊆ E 𝐹 (𝑋, E𝑐𝑖 , 𝛼)                     (2)   1. Parkinson’s disease prediction: The Parkinson’s Progression
                                                                                         Marker Initiative (PPMI) [12] is an observational study where the
   For a given instance (𝑥, 𝑦), we denote 𝐿(𝑥, 𝑦, 𝑋, 𝜃 ) as the loss                     aim is to identify Parkinson’s disease progression from various
function using a subset 𝑋 of the features as obtained from the                           types of features. The PPMI data set consists of various features
Feature selector optimization problem. The optimization problem                          related to various motor functions and non-motor behavioral and
for learning the parameters of a classifier can be posed as:                             psychological tests. We consider certain motor assessment features
                       𝑛                                                                 like rising from chair, gait, freezing of gait, posture and postural sta-
                                                                                         bility as observed features and rest all features as elicitable features
                       Õ
                 min         𝐿(𝑥𝑖 , 𝑦𝑖 , 𝑋𝛼𝑖 , 𝜃 ) + 𝜆1𝑐 (𝑋𝛼𝑖 ) + 𝜆2 ||𝜃 || 2   (3)
                  𝜃
                       𝑖=1                                                               which must be acquired at a cost.
  KiML’20, August 24, 2020, San Diego, California, USA,
                                                                                                              Srijita Das, Rishabh Iyer, and Sriraam Natarajan


  Algorithm 1 Cost Aware Feature Elicitation
    1: function CAFE(E, O, E, 𝑀, 𝐵)
    2:     E = E O∪E         ⊲ E consists of 0 cost features O and costly
       features E
    3:     𝐶 = Cluster(E O )         ⊲ Clustering based on the observed
       features O
    4:     X = {∅}          ⊲ Stores best feature subsets of each cluster
    5:     for 𝑖 = 1 to |𝐶 | do                     ⊲ Repeat for every cluster
    6:         E𝑐𝑖 = GetClusterMember(E, 𝐶, 𝑖)
    7:               ⊲ get the data points belonging to each cluster 𝑐𝑖
    8:         X𝑐𝛼𝑖 = FeatureSelector(E𝑐𝑖 , 𝛼, O, E, 𝑀, 𝐵)
    9:                ⊲ Parameterized feature selector for each cluster
   10:         X = X ∪ {X𝑐𝛼𝑖 ∪ O}
   11:     end for
   12:     for 𝑖 = 1 to |𝐶 | do                     ⊲ Repeat for every cluster
   13:         X𝑐𝛼𝑖 = GetFeatureSubset(X, 𝑖)                                     Figure 2: Recall Vs number of clusters for Rare disease for
   14:                       ⊲ Get the feature subset for each cluster 𝑐𝑖        CAFE-I
   15:         for 𝑗 = 1 to |E𝑐𝑖 | do ⊲ Repeat for every data point in
       cluster 𝑐𝑖
   16:             Optimize 𝐽 (𝑥 𝑗 , 𝑦 𝑗 , X𝑐𝛼𝑖 , 𝜃, 𝑀)                          and built upon it. We consider two variants of CAFE:(1) CAFE in
   17:                 ⊲ Optimize the objective function in Equation 3           which we replace the missing and unimportant features of every
   18:             Update 𝜃              ⊲ Update the model parameter 𝜃          cluster with 0 and then update the classifier parameters (2) CAFE-I
   19:         end for                                                           where we replace the missing and unimportant features by using an
   20:     end for                                                               imputation model learnt from the already acquired feature values
            return G(E O∪𝑋𝛼 , 𝛼, 𝜃 ) ⊲ G is the training model built on E        of other data points. A simple imputation model is used where we
   21: end function                                                              replace the missing features with mode for categorical features and
                                                                                 mean for numeric features.
                                                                                    Baselines: We consider 3 baselines for evaluating CAFE and
2. Alzheimer’s disease prediction: The Alzheimer’s Disease Neu-                  CAFE-I: (1) using the observed and zero cost features to update
   roIntiative (ADNI1 ) is a study that aims to test whether various             the training model denoted as OBS (2) using a random subset of
   clinical, FMRI and biomarkers can be used to predict the early onset          fixed number of elicitable features and all the observed features
   of Alzheimer’s disease. In this data set, we consider the demograph-          to update the training model denoted as RANDOM. For this baseline,
   ics of the patients as observed and zero cost features and the FMRI           the results are averaged over 10 runs. (3) using the information
   image data and cognitive score data as unobserved and elicitable              theoretic feature selector score as defined in Equation 1 to select
   features.                                                                     the ’k’ best elicitable features on the entire data without any cluster
3. Rare disease prediction This data set is created from survey                  consideration along with the observed features denoted as KBEST.
   questionnaires [11] and the task here is to predict whether a person          We keep the value of ’k’ to be the same as that used by CAFE.
   has rare disease or not. The demographic features are observed                Although some of the existing methods could be potential baselines,
   while other sensitive questions in the survey regarding technology            none of these methods match the exact setting of our problem, hence
   use, health and disease related meta information is considered to             we do not compare our method against them.
   be elicitable.                                                                   Results: We aim to answer the following questions:
      Evaluation Methodology: All the data sets were partitioned                   Q1: How does CAFE and CAFE-I with hard budget on features
   into a 80:20 train-test split. Hyper parameters like the number of                  compare against the standard baselines?
   clusters on the observed features were picked by doing 5 fold cross             Q2: How does the cost-sensitive version of CAFE and CAFE-I
   validation on all the data sets. The optimal number of clusters                     fare against the cost-sensitive baseline KBEST?
   picked were 6 for ADNI, 9 for Rare disease data set and 7 for the
   PPMI data set. For the results reported in Table 1, we considered a              The results reported in Table 1 suggests both CAFE and CAFE-
   hard budget on the number of elicitable features and set it to half           I significantly outperform the other baselines in almost all the
   of the total number of features in the respective data set. We use K-         metrics for Rare disease and PPMI data set. For ADNI, CAFE and
   means clustering as the underlying clustering algorithm. For all the          CAFE-I outperform the other baselines in clinically relevant recall
   reported results, we use an underlying Support Vector Machine [3]             metric while KBEST performs the best for the other metrics. The
   classifier with Radial basis kernel function. Since, all the data sets        reason for this is that in ADNI, since, the elicitable features are
   are highly imbalanced, hence we consider metrics like recall, F1,             image features and we discretize the image features to calculate
   AUC-ROC and precision for our reported results. For the Feature               the information gain for the feature selector module, the granular
   selector module, we used the existing implementation of Li et al. [7]         level feature information is lost because of this discretization and
                                                                                 hence the drop in performance. For the experiments in Table 1,
  1 www.loni.ucla.edu/ADNI                                                       we keep the budget to be approximately half of the total number
                                                                                                          KiML’20, August 24, 2020, San Diego, California, USA,
Cost Aware Feature Elicitation


                                 Data set Algorithm      Recall           F1       AUC-ROC        AUC-PR
                                             OBS         0.647          0.488        0.642          0.347
                                          RANDOM      0.57 ± 0.064  0.549± 0.059 0.693 ± 0.042 0.421 ± 0.051
                            Rare disease   KBEST          0.47          0.457        0.628          0.349
                                            CAFE         0.647          0.628        0.749          0.489
                                           CAFE-I        0.647          0.647        0.759         0.512
                                             OBS         0.765          0.685        0.741          0.563
                                          RANDOM 0.857 ± 0.023 0.809 ± 0.015 0.85 ± 0.013 0.712 ± 0.020
                                PPMI       KBEST         0.828          0.807        0.846          0.716
                                            CAFE         0.846          0.817        0.855          0.726
                                           CAFE-I        0.855          0.829        0.865         0.743
                                             OBS           0.5           0.44        0.553          0.365
                                          RANDOM 0.711 ± 0,043 0.697 ± 0.082 0.767 ± 0.064 0.592 ± 0.098
                                ADNI       KBEST          0.73          0.745        0.806         0.646
                                            CAFE         0.807          0.711        0.786          0.578
                                           CAFE-I        0.769          0.701        0.776          0.574
                             Table 1: Comparison of CAFE against other baseline methods on 3 real data sets

          Dataset          # Pos   # Neg    # Observed   # Elicitable      5    CONCLUSION
           PPMI             554     919         5             31
           ADNI              94     287         6             69
                                                                           In this paper, we pose the prediction time feature elicitation problem
       Rare Disease          87     232         6             63           as an optimization problem by employing a cluster specific feature
Table 2: Data set details of the 3 real data sets used.#Pos is num-        selector to choose the best feature subset and then optimizing the
ber of positive example, #Neg is number of negative example. # Ob-         training loss. We show the effectiveness of our approach in real data
served is number of observed features and # Elicitable is the maxi-        sets where the problem set up is intuitive. Future work includes
mum number of features that can be acquired.                               learning the parameters of the feature selector module and jointly
                                                                           optimizing the feature selector and model parameters for a more
                                                                           robust framework and adding more constraints to optimization.

of features for all the methods. On an average, CAFE-I performs            ACKNOWLEDGEMENTS
better than CAFE across all the data sets because of the underlying
                                                                           SN & SD gratefully acknowledge the support of NSF grant IIS-
imputation model which helps in better treatment of the missing
                                                                           1836565. Any opinions, findings and conclusion or recommenda-
values as against replacing all the features by 0. This answers Q1
                                                                           tions are those of the authors and do not necessarily reflect the
affirmatively.
                                                                           view of the US government.
   In Figure 3, we compare the cost version of CAFE and CAFE-I
against KBEST. Cost version takes into account the cost of individ-
ual features and accounts for them as penalty in the feature selector      REFERENCES
                                                                            [1] Gavin Brown, Adam Pocock, Ming-Jie Zhao, and Mikel Luján. 2012. Conditional
module. Hence, in this version of CAFE, a cost budget is used as                likelihood maximisation: a unifying framework for information theoretic feature
opposed to hard budget on the number of elicitable features. We gen-            selection. JMLR (2012).
erate the cost vector by sampling each cost component uniformly             [2] Xiaoyong Chai, Lin Deng, Qiang Yang, and Charles X Ling. 2004. Test-cost
                                                                                sensitive naive bayes classification. In ICDM.
from (0,1). For PPMI and Rare disease, we can see that cost sensitive       [3] Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine
CAFE performs consistently better than KBEST with increasing                    learning (1995).
cost budget. In the PPMI data set, the greedy optimization of the           [4] Gabriel Dulac-Arnold, Ludovic Denoyer, Philippe Preux, and Patrick Gallinari.
                                                                                2011. Datum-wise classification: a sequential approach to sparsity. In ECML
feature selector objective on the entire data set lead to elicitation of        PKDD. 375–390.
just 1 feature, beyond that the information gain was negative, hence        [5] Tianshi Gao and Daphne Koller. 2011. Active classification based on value of
                                                                                classifier. In NIPS.
the performance of PPMI across various cross budget remains the             [6] P. Kanani and P. Melville. 2008. Prediction-time active feature-value acquisition
same. CAFE on the other hand was able to select important feature               for cost-effective customer targeting. Workshop on Cost Sensitive Learning at
subsets for various clusters based on the observed features related             NIPS (2008).
                                                                            [7] Jundong Li, Kewei Cheng, Suhang Wang, Fred Morstatter, Robert P Trevino,
to gait and postures. For ADNI data set, CAFE performs better than              Jiliang Tang, and Huan Liu. 2018. Feature selection: A data perspective. ACM
KBEST only in recall. The reason for this is the same as mentioned              Computing Surveys (CSUR) (2018).
above. This helps in answering Q2 affirmatively.                            [8] Charles X Ling, Qiang Yang, Jianning Wang, and Shichao Zhang. 2004. Decision
                                                                                trees with minimal costs. In ICML.
   Lastly, Figure 2 shows the effect of increasing cluster on the           [9] D. J. Lizotte, O. Madani, and R. Greiner. 2003. Budgeted learning of Naive-Bayes
validation recall for the Rare disease data set. As can be seen, for            classifiers (UAI). 378–385.
                                                                           [10] Chao Ma, Sebastian Tschiatschek, Konstantina Palla, Jose Miguel Hernandez-
smaller number of clusters, the recall is very low and increases to             Lobato, Sebastian Nowozin, and Cheng Zhang. 2019. EDDI: Efficient Dynamic
an optimum for 9 clusters. This helps us in understanding the fact              Discovery of High-Value Information with Partial VAE. In ICML.
that forming clusters based on observed important features helps           [11] H. MacLeod, S. Yang, et al. 2016. Identifying rare diseases from behavioural data:
                                                                                a machine learning approach (CHASE). 130–139.
CAFE in selecting different feature subsets for different clusters,        [12] K. Marek, D. Jennings, et al. 2011. The Parkinson Progression Marker Initiative
thus helping the learning procedure.                                            (PPMI). Prog Neurobiol 95, 4 (2011), 629–635.
KiML’20, August 24, 2020, San Diego, California, USA,
                                                                                                                              Srijita Das, Rishabh Iyer, and Sriraam Natarajan


Figure 3: Recall (left), F1 (middle), AUC-PR (right) for (from top to bottom) Rare Disaese, PPMI, and ADNI. The x-axis refers
to the cost budget used which leads to the elicitation of different number of features.


[13] P. Melville, M. Saar-Tsechansky, et al. 2004. Active feature-value acquisition for        (2005), 1226–1238.
     classifier induction (ICDM). 483–486.                                                [22] Thomas Rückstieß, Christian Osendorfer, and Patrick van der Smagt. 2011. Se-
[14] P. Melville, M. Saar-Tsechansky, et al. 2005. An expected utility approach to             quential feature selection for classification. In Australasian Joint Conference on
     active feature-value acquisition (ICDM). 745–748.                                         Artificial Intelligence. Springer, 132–141.
[15] Feng Nan and Venkatesh Saligrama. 2017. Adaptive classification for prediction       [23] M. Saar-Tsechansky, P. Melville, and F. Provost. 2009. Active feature-value
     under a budget. In NIPS.                                                                  acquisition. Manag Sci 55, 4 (2009).
[16] Feng Nan, Joseph Wang, and Venkatesh Saligrama. 2015. Feature-budgeted               [24] Victor S Sheng and Charles X Ling. 2006. Feature value acquisition in testing: a
     random forest. In ICML.                                                                   sequential batch test algorithm. In ICML.
[17] Feng Nan, Joseph Wang, and Venkatesh Saligrama. 2016. Pruning random forests         [25] Hajin Shim, Sung Ju Hwang, and Eunho Yang. 2018. Joint active feature acquisi-
     for prediction on a budget. In NIPS.                                                      tion and classification with variable-size set encoding. In NIPS.
[18] Feng Nan, Joseph Wang, Kirill Trapeznikov, and Venkatesh Saligrama. 2014. Fast       [26] Joseph Wang, Kirill Trapeznikov, and Venkatesh Saligrama. 2015. Efficient learn-
     margin-based cost-sensitive classification. In ICASSP.                                    ing by directed acyclic graph for resource constrained prediction. In NIPS.
[19] Sriraam Natarajan, Srijita Das, Nandini Ramanan, Gautam Kunapuli, and Predrag        [27] Zhixiang Xu, Matt Kusner, Kilian Weinberger, and Minmin Chen. 2013. Cost-
     Radivojac. 2018. On Whom Should I Perform this Lab Test Next? An Active                   sensitive tree of classifiers. In ICML.
     Feature Elicitation Approach.. In IJCAI.                                             [28] Zhixiang Xu, Kilian Q Weinberger, and Olivier Chapelle. 2012. The greedy miser:
[20] S. Natarajan, A. Prabhakar, et al. 2017. Boosting for postpartum depression               learning under test-time budgets. In ICML.
     prediction (CHASE). 232–240.                                                         [29] Z. Zheng and B. Padmanabhan. 2002. On active learning for data acquisition
[21] Hanchuan Peng, Fuhui Long, and Chris Ding. 2005. Feature selection based                  (ICDM). 562–569.
     on mutual information criteria of max-dependency, max-relevance, and min-
     redundancy. IEEE Transactions on pattern analysis and machine intelligence 27, 8