<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>San Diego,
California, USA, August</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Srijita Das</string-name>
          <email>Srijita.Das@utdallas.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rishabh Iyer</string-name>
          <email>Rishabh.Iyer@utdallas.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sriraam Natarajan</string-name>
          <email>Sriraam.Natarajan@utdallas.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>The University of Texas at Dallas</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <volume>24</volume>
      <issue>2020</issue>
      <abstract>
        <p>Motivated by clinical tasks where acquiring certain features such as FMRI or blood tests can be expensive, we address the problem of test-time elicitation of features. We formulate the problem of costaware feature elicitation as an optimization problem with trade-of between performance and feature acquisition cost. Our experiments on three real-world medical tasks demonstrate the eficacy and efectiveness of our proposed approach in minimizing costs and maximizing performance.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>CCS CONCEPTS</title>
      <p>• Supervised learning → Budgeted learning; Feature
selection; • Applications → Healthcare.</p>
    </sec>
    <sec id="sec-2">
      <title>KEYWORDS</title>
      <p>cost sensitive learning, supervised learning, classification</p>
    </sec>
    <sec id="sec-3">
      <title>INTRODUCTION</title>
      <p>
        In supervised classification setting, every instance has a fixed
feature vector and a discriminative function is learnt on such
fixedlength feature vector and it’s corresponding class variable. However,
a lot of practical problems like healthcare, network domains,
designing survey questionnaire [
        <xref ref-type="bibr" rid="ref19 ref20">19, 20</xref>
        ] etc has an associated feature
acquisition cost. In such domains, there is a cost budget and
getting all the features of an instance can be very costly. As a result,
many cost sensitive classifier models [
        <xref ref-type="bibr" rid="ref2 ref24 ref8">2, 8, 24</xref>
        ] have been proposed
in literature to incorporate the cost of acquisition into the model
objective during training and prediction.
      </p>
      <p>Our problem is motivated by such a cost-aware setting where the
assumption is that prediction time features have an acquisition cost
and adheres to a strict budget. Consider a patient visiting a doctor
for some potential diagnosis of a disease. For such a patient,
information like age, gender, ethnicity and other demographic features
are easily available at zero cost. However, various lab tests that the
patient needs to undergo incurs cost. So, a training model should be
able to identify the most relevant (i.e. those which are most
informative, yet least costly) lab tests that are required for each specific
patient. The intuition of this work is that diferent patients,
depending on their history, ethnicity, age and gender, may require diferent
tests for reasonably accurate prediction. We build on the intuition
that given certain observed features like one’s demographic details,
the most important features for a patient depends on the important
features for similar patients. Based on this intuition, we find out
similar data points in the observed feature space and identify the
important feature subsets of these similar instances by employing
a greedy information theoretic feature selector objective.</p>
      <p>Our contributions in this work are as follows: (1) formalize the
problem as a joint optimization problem of selecting the best feature
subset for similar data points and optimizing the loss function using
the important feature subsets. (2) account for acquisition cost in
both the feature selector objective and classifier objective to balance
the trade-of between acquisition cost and model performance. (3)
empirically demonstrate the efectiveness of the proposed approach
on three real-world medical data sets.
2</p>
    </sec>
    <sec id="sec-4">
      <title>RELATED WORK</title>
      <p>
        The related work on cost-sensitive feature selection and learning
can be categorized into the following four broad approaches.
Tree based budgeted learning: Prediction time elicitation of
features under a cost budget has been widely studied in literature. A
lot of work has been done in tree based models [
        <xref ref-type="bibr" rid="ref16 ref17 ref26 ref27 ref28 ref5">5, 16, 17, 26–28</xref>
        ]
by adding cost term to the tree objective function in either
decision trees or ensemble methods like gradient boosted trees. All
these methods aim to build an adaptive and complex decision tree
boundary by considering trade-of between performance and
testtime feature acquisition cost. While we are similar in motivation to
these approaches, our methodology is diferent in the sense that
we do not consider tree based models. Instead our approach aims
to find local feature subsets using an information theoretic feature
selector for diferent clusters of training instance build in a lower
dimensional space.
      </p>
      <p>
        Adaptive classification and dynamic feature discovery: Our
work also draws inspiration from Nan al.’s work [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] where they
learn a high performance costly model and approximate the model’s
performance adaptively by building a low cost model and gating
function which decides which model to use for specific training
instances. This adaptive switching between low and high cost model
takes care of the trade-of between cost and performance. Our
method is diferent from theirs because we do not maintain a high
cost model which is costly to build and and dificult to decide. We
refine the parameters of a single low cost model by incorporating a
cost penalty in the feature selector and model objective. Our work
is also along the direction of Nan et al.’s work [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] where they select
varying feature subsets for test instance using neighbourhood
information of the training data. While calculating the neighborhood
information from training data is similar to building clusters in
our approach, the training neighborhood for our method is on just
the observed feature space. Moreover, we incorporate the
neighbourhood information in the training algorithm whereas Nan et
al.’s work is a prediction time algorithm. Ma et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] also address
this problem of dynamic discovery of features based on generative
modelling and Bayesian experimental design.
      </p>
      <p>
        Feature elicitation using Reinforcement learning: There is
another line of work along the sequential decision making
literature [
        <xref ref-type="bibr" rid="ref22 ref4 ref9">4, 9, 22</xref>
        ] to model the test time elicitation of features by
learning the optimal policy of test feature acquisition. Along this
direction, our work aligns with the work of Shim et al. [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] where
they jointly train a classifier and RL agent together. Their classifier
objective function is similar to our method with a cost penalty,
however they use a Deep RL agent to figure out the policy. We on
the other hand use localised feature selector to find the important
feature subsets for the underlying training clusters in the observed
feature space.
      </p>
      <p>
        Active Feature Acquisition: Our problem set-up is also inspired
by work along active feature acquisition [
        <xref ref-type="bibr" rid="ref13 ref14 ref19 ref23 ref29">13, 14, 19, 23, 29</xref>
        ] where
certain feature subsets are observed and rest are acquired at a cost.
While all the above mentioned work follow this problem set up
during training time and typically use active learning to seek
informative instances at every iteration, we use this particular setting
for test instances. Unlike their work, all the training instances in
our work are fully observed and the assumption is that the feature
acquisition cost has already being paid during training. Also, we
address a supervised classification problem instead of an active
learning set up. Our problem set up is similar to Kanani et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] as
they also have partial test instances, however their problem is that
of instance acquisition where the acquired feature subset is fixed.
Our method aims at discovering variable length feature subsets for
various underlying clusters.
      </p>
      <p>Our contributions: Although the problem of prediction time
feature elicitation has been explored in literature from various
directions and with various assumptions, we come up with an
intuitive solution to this problem and formulate the problem in a two
step optimization framework. We incorporate acquisition cost
in both the feature selector and model objectives to balance the
performance and cost trade-of. The problem set up is naturally
applicable in real world health care and other domains where the
knowledge of the observed features also needs to be accounted
while selecting the elicitable features .
3
3.1</p>
    </sec>
    <sec id="sec-5">
      <title>COST AWARE FEATURE ELICITATION</title>
    </sec>
    <sec id="sec-6">
      <title>Problem setup</title>
      <p>Given: A dataset {(1, 1), · · ·, (, )} with each  ∈ R as the
feature set. Each feature has an associated cost  .</p>
      <p>Objective: Learn a discriminative model which is aware of the
feature costs and can balance the trade-of between feature acquisition
cost and model performance.</p>
      <p>We make an additional assumption here that there is a subset of
features which have 0 cost. These could be, for example, demographic
information (e.g. age, gender, etc) in a medical domain which are
easily available/less cumbersome to obtain as compared to other
features. In other words, we can partition the feature set F = O ∪ E
where O are the zero cost observed features and E are the elicitable
features which can be acquired at a cost. We also assume that the
training data is completely available with all features (i.e. the cost
for all the features has already been paid). The goal is to use these
observed features to find similar instances in the training set and
identify the important feature subsets for each of these clusters
based on a feature selector objective function which balances the
trade-of between choosing the important features and the cost at
which these features are acquired.
3.2</p>
    </sec>
    <sec id="sec-7">
      <title>Proposed solution</title>
      <p>As a first step, we cluster the training instances based on just the
observed zero cost feature set O. The intuition is that instances
with similar features will also have similar characteristics in terms
of which elicitable features to order. For example, in a medical
application, whether to request for a blood test or a ct-scan will depend
on factors such as age, gender, ethnicity and whether patients with
similar demographic features had requested these tests. Also, since
the feature set O, comes at zero cost, we assume that for unseen
test instances, this feature set is observed.</p>
      <p>We propose a model which consists of a parameterized feature
selector module  ( , E ,  ) which takes in a set of input instances
 belonging to the cluster  based on the feature set O and
produces a subset  of most important features for the classification
task. The feature selection model is based on an information-
theoretic objective function and is augmented with the feature cost to
account for the trade of between model performance and
acquisition cost at test-time. The output feature subset from the feature
selector module are used to update the parameters of the classifier.
The optimization framework is shown in Figure 1</p>
      <sec id="sec-7-1">
        <title>Information theoretic Feature selector model: The feature</title>
        <p>
          selector module selects the best subset of features for each cluster
of training data based on an information theoretic objective score.
Since at test time, we do not know the elicitable feature subset E
(since the goal of feature selection is in the first place to find the
truly necessary features for learning). Hence we propose to use the
closest set of instances in the training data to the current instance.
Since we assume that the training data has already been elicited,
we have all the features observed in the training data. We compute
this distance just based on the observed feature set O. We cluster
the training data based on the observed features into m clusters
1, 2, · · · . Next, we use the Minimum-Redundancy-Maximum
Relevance (MRMR) feature Selection paradigm [
          <xref ref-type="bibr" rid="ref1 ref21">1, 21</xref>
          ]. We denote
parameters [1 , 2 , 3 , 4 ] as parameters of a particular cluster
 . The feature selection module is a function of the parameters of
1
 ( , E ,  ) =
        </p>
        <p>|
Õ</p>
        <p>E ∈
−
Õ</p>
        <p>©­2
E ∈ «
|</p>
        <p>4
− 
|
Õ</p>
        <p>(E )
E ∈</p>
        <p>{z
cost penalty
}
Õ</p>
        <p>(E ;  )
E ∈</p>
        <p>{z
max. relevance
}
3
 (E  ; E ) −</p>
        <p>
          {z
min. redundancy
Õ
E ∈
 (E ; E  | )ª®
¬
}
(1)
where  (E ;  ) is the mutual information between the random
variable E (feature) and  (target). In the above equation, the feature
subset  is grown greedily using a greedy optimization strategy
maximizing the above objective function. In equation 1, E denotes
a single feature from the elicitable set E that is considered for
evaluation based on the subset  grown so far. The first term is the
mutual information between each feature and the class variable  .
In a discriminative task, this value should be maximized. The
second term is the pairwise mutual information between each feature
to be evaluated and the features already added to the feature subset
 . This value needs to be minimized for selecting informative
features. The third term is called the conditional redundancy [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] and
this term needs to be maximized. The last term adds the penalty
for cost of every feature and ensures the right trade-of between
cost, relevance and redundancy. For this work, we do not learn the
parameters  for each cluster, instead fix these parameters to 1.
We leave the learning of these parameters to future work.
        </p>
        <p>In the problem setup, since the 0 cost feature subset is always
present, we always consider the observed feature subset O in
addition to the most important feature subset as returned by the
Feature selector objective. We also account for the knowledge of
the observed features while growing the informative feature subset
through greedy optimization. Specifically, while calculating the
pairwise mutual information between the features and the
conditional redundancy term (second and third term of equation 1), we
also evaluate the mutual information of the features with these
observed features. It is to be noted that in cases where the observed
features are not discriminative enough of the target, the feature
selector module ensures that the elicitable features with maximum
relevance to the target variable are picked.</p>
        <p>Optimization Problem: The cost aware feature selector
 ( , E ,  ) for a given set of instance E belonging to a specific
cluster  solves the following optimization problem:</p>
        <p>= argmax ⊆ E  ( , E ,  )</p>
        <p>For a given instance (, ), we denote  (, ,  ,  ) as the loss
function using a subset  of the features as obtained from the
Feature selector optimization problem. The optimization problem
for learning the parameters of a classifier can be posed as:

min Õ  (, ,  ,  ) + 1 ( ) + 2 || ||2
 =1
the cluster to which a set of instances belong and is defined as:
where 1 and 2 are hyper-parameters. In the above equation, 
is the parameter of the model and can be updated by standard
gradient based techniques. This loss function takes into account the
important feature subset for each cluster and updates the parameter
accordingly. The classifier objective also consists of a cost term
denoted by  ( ) to account for the cost of the selected feature
subset. For hard budget on the elicited features, the cost component
in the model objective can be considered. In case of a cost budget,
this component can be ignored because the elicited feature subset
adheres to a fixed cost and hence, this term is constant.
3.3</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Algorithm</title>
      <sec id="sec-8-1">
        <title>We present the algorithm for Cost Aware Feature Elicitation</title>
        <p>(CAFE) in Algorithm 1. CAFE takes as input set of training examples
E, the zero cost feature set O, the elicitable feature subset E, a cost
vector  ∈ R and a budget . Each element in the training set E
consists of a tuple (, ) where  ∈ R is the feature vector and y
is the label.</p>
        <p>The training instances E are clustered based on just the observed
feature set O using K-means clustering (Cluster). For every cluster
 , the training instances belonging to the cluster is assigned to
the set E and is passed to the Feature Selector module (lines 6-8).
The FeatureSelector function takes E , parameter  , the feature
subsets O and E, cost vector  and a predefined budget  as input
and returns the most important feature subset X corresponding
to a cluster  . A greedy optimization technique is used to grow
the feature subset  of every cluster based on the feature selector
objective function defined in Equation 1. The FeatureSelector
terminates once the budget  is exhausted or the mutual
information score becomes negative. Once all the important feature subsets
are obtained for all the | | clusters, the model objective function is
optimized as mentioned in Equation 3 for all the training instances
using the important feature subsets for the clusters to which the
training instances belong (lines 12-18). All the remaining features
are imputed by using either 0 or any other imputation model
before training the model. The final training model G(EO∪ , ,  )
is an unified model used to make predictions for a test-instance
consisting of just the observed feature subset O.
4</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>EMPIRICAL EVALUATION</title>
      <p>We did experiments with 3 real world medical data sets. The
intuition of CAFE makes more sense in medical domains, hence our
choice of data sets. However, the idea can be applied to other
domains ranging from logistics to resource allocation task. Table 2
jots down the various features of the data sets used in our
experiments. Below are the details of the 3 real data sets, we use for our
experiments.
(2)
(3)</p>
      <sec id="sec-9-1">
        <title>1. Parkinson’s disease prediction: The Parkinson’s Progression</title>
        <p>
          Marker Initiative (PPMI) [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] is an observational study where the
aim is to identify Parkinson’s disease progression from various
types of features. The PPMI data set consists of various features
related to various motor functions and non-motor behavioral and
psychological tests. We consider certain motor assessment features
like rising from chair, gait, freezing of gait, posture and postural
stability as observed features and rest all features as elicitable features
which must be acquired at a cost.
Algorithm 1 Cost Aware Feature Elicitation
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
1: function CAFE(E, O, E, , )
2: E = EO∪E ⊲ E consists of 0 cost features O and costly
features E
3:  = Cluster(EO ) ⊲ Clustering based on the observed
features O
        </p>
        <p>X = {∅} ⊲ Stores best feature subsets of each cluster
for  = 1 to | | do ⊲ Repeat for every cluster
E = GetClusterMember(E, , )</p>
        <p>⊲ get the data points belonging to each cluster 
X = FeatureSelector(E , , O, E, , )
⊲ Parameterized feature selector for each cluster</p>
        <p />
        <p>X = X ∪ {X ∪ O}
end for
for  = 1 to | | do ⊲ Repeat for every cluster
X = GetFeatureSubset(X, )</p>
        <p>⊲ Get the feature subset for each cluster 
for  = 1 to |E | do ⊲ Repeat for every data point in
cluster 
16: Optimize  (  ,   , X , , )
17: ⊲ Optimize the objective function in Equation 3
18: Update  ⊲ Update the model parameter 
19: end for
20: end for</p>
        <p>
          return G(EO∪ , ,  ) ⊲ G is the training model built on E
21: end function
2. Alzheimer’s disease prediction: The Alzheimer’s Disease
NeuroIntiative (ADNI1) is a study that aims to test whether various
clinical, FMRI and biomarkers can be used to predict the early onset
of Alzheimer’s disease. In this data set, we consider the
demographics of the patients as observed and zero cost features and the FMRI
image data and cognitive score data as unobserved and elicitable
features.
3. Rare disease prediction This data set is created from survey
questionnaires [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] and the task here is to predict whether a person
has rare disease or not. The demographic features are observed
while other sensitive questions in the survey regarding technology
use, health and disease related meta information is considered to
be elicitable.
        </p>
        <p>
          Evaluation Methodology: All the data sets were partitioned
into a 80:20 train-test split. Hyper parameters like the number of
clusters on the observed features were picked by doing 5 fold cross
validation on all the data sets. The optimal number of clusters
picked were 6 for ADNI, 9 for Rare disease data set and 7 for the
PPMI data set. For the results reported in Table 1, we considered a
hard budget on the number of elicitable features and set it to half
of the total number of features in the respective data set. We use
Kmeans clustering as the underlying clustering algorithm. For all the
reported results, we use an underlying Support Vector Machine [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]
classifier with Radial basis kernel function. Since, all the data sets
are highly imbalanced, hence we consider metrics like recall, F1,
AUC-ROC and precision for our reported results. For the Feature
selector module, we used the existing implementation of Li et al. [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]
1www.loni.ucla.edu/ADNI
and built upon it. We consider two variants of CAFE:(1) CAFE in
which we replace the missing and unimportant features of every
cluster with 0 and then update the classifier parameters (2) CAFE-I
where we replace the missing and unimportant features by using an
imputation model learnt from the already acquired feature values
of other data points. A simple imputation model is used where we
replace the missing features with mode for categorical features and
mean for numeric features.
        </p>
        <p>Baselines: We consider 3 baselines for evaluating CAFE and
CAFE-I: (1) using the observed and zero cost features to update
the training model denoted as OBS (2) using a random subset of
ifxed number of elicitable features and all the observed features
to update the training model denoted as RANDOM. For this baseline,
the results are averaged over 10 runs. (3) using the information
theoretic feature selector score as defined in Equation 1 to select
the ’k’ best elicitable features on the entire data without any cluster
consideration along with the observed features denoted as KBEST.
We keep the value of ’k’ to be the same as that used by CAFE.
Although some of the existing methods could be potential baselines,
none of these methods match the exact setting of our problem, hence
we do not compare our method against them.</p>
        <p>Results: We aim to answer the following questions:
Q1: How does CAFE and CAFE-I with hard budget on features
compare against the standard baselines?
Q2: How does the cost-sensitive version of CAFE and CAFE-I
fare against the cost-sensitive baseline KBEST?</p>
        <p>The results reported in Table 1 suggests both CAFE and
CAFEI significantly outperform the other baselines in almost all the
metrics for Rare disease and PPMI data set. For ADNI, CAFE and
CAFE-I outperform the other baselines in clinically relevant recall
metric while KBEST performs the best for the other metrics. The
reason for this is that in ADNI, since, the elicitable features are
image features and we discretize the image features to calculate
the information gain for the feature selector module, the granular
level feature information is lost because of this discretization and
hence the drop in performance. For the experiments in Table 1,
we keep the budget to be approximately half of the total number
Rare disease
of features for all the methods. On an average, CAFE-I performs
better than CAFE across all the data sets because of the underlying
imputation model which helps in better treatment of the missing
values as against replacing all the features by 0. This answers Q1
afirmatively.</p>
        <p>In Figure 3, we compare the cost version of CAFE and CAFE-I
against KBEST. Cost version takes into account the cost of
individual features and accounts for them as penalty in the feature selector
module. Hence, in this version of CAFE, a cost budget is used as
opposed to hard budget on the number of elicitable features. We
generate the cost vector by sampling each cost component uniformly
from (0,1). For PPMI and Rare disease, we can see that cost sensitive
CAFE performs consistently better than KBEST with increasing
cost budget. In the PPMI data set, the greedy optimization of the
feature selector objective on the entire data set lead to elicitation of
just 1 feature, beyond that the information gain was negative, hence
the performance of PPMI across various cross budget remains the
same. CAFE on the other hand was able to select important feature
subsets for various clusters based on the observed features related
to gait and postures. For ADNI data set, CAFE performs better than
KBEST only in recall. The reason for this is the same as mentioned
above. This helps in answering Q2 afirmatively.</p>
        <p>Lastly, Figure 2 shows the efect of increasing cluster on the
validation recall for the Rare disease data set. As can be seen, for
smaller number of clusters, the recall is very low and increases to
an optimum for 9 clusters. This helps us in understanding the fact
that forming clusters based on observed important features helps
CAFE in selecting diferent feature subsets for diferent clusters,
thus helping the learning procedure.</p>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>5 CONCLUSION</title>
      <p>In this paper, we pose the prediction time feature elicitation problem
as an optimization problem by employing a cluster specific feature
selector to choose the best feature subset and then optimizing the
training loss. We show the efectiveness of our approach in real data
sets where the problem set up is intuitive. Future work includes
learning the parameters of the feature selector module and jointly
optimizing the feature selector and model parameters for a more
robust framework and adding more constraints to optimization.</p>
    </sec>
    <sec id="sec-11">
      <title>ACKNOWLEDGEMENTS</title>
      <p>SN &amp; SD gratefully acknowledge the support of NSF grant
IIS1836565. Any opinions, findings and conclusion or
recommendations are those of the authors and do not necessarily reflect the
view of the US government.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Gavin</given-names>
            <surname>Brown</surname>
          </string-name>
          , Adam Pocock,
          <string-name>
            <surname>Ming-Jie Zhao</surname>
            ,
            <given-names>and Mikel</given-names>
          </string-name>
          <string-name>
            <surname>Luján</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Conditional likelihood maximisation: a unifying framework for information theoretic feature selection</article-title>
          .
          <source>JMLR</source>
          (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Xiaoyong</given-names>
            <surname>Chai</surname>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            <given-names>Deng</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Qiang Yang</surname>
          </string-name>
          , and
          <string-name>
            <surname>Charles</surname>
            <given-names>X</given-names>
          </string-name>
          <string-name>
            <surname>Ling</surname>
          </string-name>
          .
          <year>2004</year>
          .
          <article-title>Test-cost sensitive naive bayes classification</article-title>
          .
          <source>In ICDM.</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Corinna</given-names>
            <surname>Cortes</surname>
          </string-name>
          and
          <string-name>
            <given-names>Vladimir</given-names>
            <surname>Vapnik</surname>
          </string-name>
          .
          <year>1995</year>
          .
          <article-title>Support-vector networks</article-title>
          .
          <source>Machine learning</source>
          (
          <year>1995</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Gabriel</given-names>
            <surname>Dulac-Arnold</surname>
          </string-name>
          , Ludovic Denoyer, Philippe Preux, and
          <string-name>
            <given-names>Patrick</given-names>
            <surname>Gallinari</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Datum-wise classification: a sequential approach to sparsity</article-title>
          .
          <source>In ECML PKDD</source>
          .
          <volume>375</volume>
          -
          <fpage>390</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Tianshi</given-names>
            <surname>Gao</surname>
          </string-name>
          and
          <string-name>
            <given-names>Daphne</given-names>
            <surname>Koller</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Active classification based on value of classifier</article-title>
          .
          <source>In NIPS.</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>P.</given-names>
            <surname>Kanani</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Melville</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Prediction-time active feature-value acquisition for cost-efective customer targeting</article-title>
          .
          <source>Workshop on Cost Sensitive Learning at NIPS</source>
          (
          <year>2008</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Jundong</given-names>
            <surname>Li</surname>
          </string-name>
          , Kewei Cheng, Suhang Wang, Fred Morstatter, Robert P Trevino, Jiliang Tang, and Huan Liu.
          <year>2018</year>
          .
          <article-title>Feature selection: A data perspective</article-title>
          .
          <source>ACM Computing Surveys (CSUR)</source>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Charles</surname>
            <given-names>X Ling</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Qiang</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Jianning</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Shichao</given-names>
            <surname>Zhang</surname>
          </string-name>
          .
          <year>2004</year>
          .
          <article-title>Decision trees with minimal costs</article-title>
          .
          <source>In ICML.</source>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Lizotte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Madani</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Greiner</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>Budgeted learning of Naive-Bayes classifiers</article-title>
          (UAI).
          <volume>378</volume>
          -
          <fpage>385</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Chao</surname>
            <given-names>Ma</given-names>
          </string-name>
          , Sebastian Tschiatschek, Konstantina Palla, Jose Miguel HernandezLobato, Sebastian Nowozin, and Cheng Zhang.
          <year>2019</year>
          .
          <article-title>EDDI: Eficient Dynamic Discovery of High-Value Information with Partial VAE</article-title>
          . In ICML.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>H.</given-names>
            <surname>MacLeod</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yang</surname>
          </string-name>
          , et al.
          <year>2016</year>
          .
          <article-title>Identifying rare diseases from behavioural data: a machine learning approach</article-title>
          <source>(CHASE)</source>
          .
          <volume>130</volume>
          -
          <fpage>139</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>K.</given-names>
            <surname>Marek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Jennings</surname>
          </string-name>
          , et al.
          <year>2011</year>
          .
          <article-title>The Parkinson Progression Marker Initiative (PPMI)</article-title>
          .
          <source>Prog Neurobiol</source>
          <volume>95</volume>
          ,
          <issue>4</issue>
          (
          <year>2011</year>
          ),
          <fpage>629</fpage>
          -
          <lpage>635</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>P.</given-names>
            <surname>Melville</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Saar-Tsechansky</surname>
          </string-name>
          , et al.
          <year>2004</year>
          .
          <article-title>Active feature-value acquisition for classifier induction (ICDM)</article-title>
          .
          <volume>483</volume>
          -
          <fpage>486</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>P.</given-names>
            <surname>Melville</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Saar-Tsechansky</surname>
          </string-name>
          , et al.
          <year>2005</year>
          .
          <article-title>An expected utility approach to active feature-value acquisition (ICDM)</article-title>
          .
          <volume>745</volume>
          -
          <fpage>748</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Feng</given-names>
            <surname>Nan</surname>
          </string-name>
          and
          <string-name>
            <given-names>Venkatesh</given-names>
            <surname>Saligrama</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Adaptive classification for prediction under a budget</article-title>
          .
          <source>In NIPS.</source>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Feng</surname>
            <given-names>Nan</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Joseph</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Venkatesh</given-names>
            <surname>Saligrama</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Feature-budgeted random forest</article-title>
          .
          <source>In ICML.</source>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Feng</surname>
            <given-names>Nan</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Joseph</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Venkatesh</given-names>
            <surname>Saligrama</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Pruning random forests for prediction on a budget</article-title>
          .
          <source>In NIPS.</source>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Feng</surname>
            <given-names>Nan</given-names>
          </string-name>
          , Joseph Wang,
          <string-name>
            <surname>Kirill Trapeznikov</surname>
            , and
            <given-names>Venkatesh</given-names>
          </string-name>
          <string-name>
            <surname>Saligrama</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Fast margin-based cost-sensitive classification</article-title>
          .
          <source>In ICASSP.</source>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Sriraam</surname>
            <given-names>Natarajan</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Srijita Das</surname>
          </string-name>
          ,
          <string-name>
            <surname>Nandini Ramanan</surname>
            , Gautam Kunapuli, and
            <given-names>Predrag</given-names>
          </string-name>
          <string-name>
            <surname>Radivojac</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>On Whom Should I Perform this Lab Test Next? An Active Feature Elicitation Approach.</article-title>
          .
          <source>In IJCAI.</source>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>S.</given-names>
            <surname>Natarajan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Prabhakar</surname>
          </string-name>
          , et al.
          <year>2017</year>
          .
          <article-title>Boosting for postpartum depression prediction (CHASE)</article-title>
          .
          <volume>232</volume>
          -
          <fpage>240</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Hanchuan</surname>
            <given-names>Peng</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Fuhui</given-names>
            <surname>Long</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Chris</given-names>
            <surname>Ding</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Feature selection based on mutual information criteria of max-dependency, max-relevance, and minredundancy</article-title>
          .
          <source>IEEE Transactions on pattern analysis and machine intelligence 27</source>
          ,
          <issue>8</issue>
          (
          <year>2005</year>
          ),
          <fpage>1226</fpage>
          -
          <lpage>1238</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Thomas</surname>
            <given-names>Rückstieß</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Christian</given-names>
            <surname>Osendorfer</surname>
          </string-name>
          , and Patrick van der Smagt.
          <year>2011</year>
          .
          <article-title>Sequential feature selection for classification</article-title>
          .
          <source>In Australasian Joint Conference on Artificial Intelligence</source>
          . Springer,
          <fpage>132</fpage>
          -
          <lpage>141</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>M.</given-names>
            <surname>Saar-Tsechansky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Melville</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Provost</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Active feature-value acquisition</article-title>
          .
          <source>Manag Sci</source>
          <volume>55</volume>
          ,
          <issue>4</issue>
          (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Victor</surname>
            <given-names>S</given-names>
          </string-name>
          <string-name>
            <surname>Sheng and Charles X Ling</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>Feature value acquisition in testing: a sequential batch test algorithm</article-title>
          .
          <source>In ICML.</source>
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Hajin</surname>
            <given-names>Shim</given-names>
          </string-name>
          , Sung Ju Hwang, and
          <string-name>
            <given-names>Eunho</given-names>
            <surname>Yang</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Joint active feature acquisition and classification with variable-size set encoding</article-title>
          .
          <source>In NIPS.</source>
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>Joseph</surname>
            <given-names>Wang</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kirill Trapeznikov</surname>
            , and
            <given-names>Venkatesh</given-names>
          </string-name>
          <string-name>
            <surname>Saligrama</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Eficient learning by directed acyclic graph for resource constrained prediction</article-title>
          .
          <source>In NIPS.</source>
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <surname>Zhixiang</surname>
            <given-names>Xu</given-names>
          </string-name>
          , Matt Kusner, Kilian Weinberger, and
          <string-name>
            <given-names>Minmin</given-names>
            <surname>Chen</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Costsensitive tree of classifiers</article-title>
          .
          <source>In ICML.</source>
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <surname>Zhixiang</surname>
            <given-names>Xu</given-names>
          </string-name>
          ,
          <article-title>Kilian Q Weinberger,</article-title>
          and
          <string-name>
            <given-names>Olivier</given-names>
            <surname>Chapelle</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>The greedy miser: learning under test-time budgets</article-title>
          .
          <source>In ICML.</source>
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zheng</surname>
          </string-name>
          and
          <string-name>
            <given-names>B.</given-names>
            <surname>Padmanabhan</surname>
          </string-name>
          .
          <year>2002</year>
          .
          <article-title>On active learning for data acquisition (ICDM)</article-title>
          .
          <volume>562</volume>
          -
          <fpage>569</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>