<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">From Must to May: Enabling Test-Time Feature Imputation and Interventions</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Evan</forename><surname>Rex</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science and Technology</orgName>
								<orgName type="institution">University of Cambridge</orgName>
								<address>
									<country key="GB">United Kingdom</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Mateo</forename><forename type="middle">Espinosa</forename><surname>Zarlenga</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science and Technology</orgName>
								<orgName type="institution">University of Cambridge</orgName>
								<address>
									<country key="GB">United Kingdom</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Andrei</forename><surname>Margeloiu</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science and Technology</orgName>
								<orgName type="institution">University of Cambridge</orgName>
								<address>
									<country key="GB">United Kingdom</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Mateja</forename><surname>Jamnik</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science and Technology</orgName>
								<orgName type="institution">University of Cambridge</orgName>
								<address>
									<country key="GB">United Kingdom</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">From Must to May: Enabling Test-Time Feature Imputation and Interventions</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">C6F1FDD7B55FEA7E8A684BB4E5BFC035</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T18:48+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Test-time interventions</term>
					<term>Missing value imputation</term>
					<term>Feature selection</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Interpretable machine learning models can be improved by correcting mispredicted intermediate steps via test-time interventions on their intermediate predictions. Methods that jointly learn to impute missing features and predict a downstream task can benefit from such interventions. However, determining which features to prioritise for intervention remains a challenge. To address this, we propose F-Act, a novel method employing feature selection to adaptively manage feature availability during test-time. Our approach achieves this by combining in-model imputation and test-time interventions on intermediate predictions to avoid the need for model retraining. Furthermore, F-Act can recommend which features to prioritise when collecting data, a key property when optimising performance in resource-limited environments. Our empirical analysis shows F-Act performs competitively or better than previous baselines in inference tasks with missing features when incorporating feature collection recommendations. Additionally, we show F-Act can incorporate missing feature values through test-time interventions, improving predictive performance without retraining across tasks.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Machine learning models for tabular datasets typically expect a complete feature set during both training and inference. However, in practice, features are often missing during inference due to the high cost and difficulty of obtaining the complete feature set for some samples (e.g., gene expression counts) <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b1">2]</ref>. Such test-time feature unavailability necessitates models that can make accurate predictions with incomplete feature sets. Additionally, since acquiring new features may be prohibitively expensive, it is crucial for these models to offer recommendations on which missing feature values to collect to maximise their impact on the model's accuracy.</p><p>Current strategies addressing limited feature availability typically involve either: (i) imputing, or predicting, missing features at test-time <ref type="bibr" target="#b0">[1]</ref>, or (ii) selecting a minimal feature subset on which the model is retrained <ref type="bibr" target="#b2">[3,</ref><ref type="bibr" target="#b3">4]</ref>. Although these methods are practical, they have clear limitations in scenarios with variable feature availability: Feature selection identifies critical features but cannot adapt to changes in feature availability, while imputation provides flexibility but lacks guidance for users on prioritising features.</p><p>In this paper, we address this gap by introducing F-Act (Feature-wise Active adaptation), a method that combines feature selection with imputation to enable adaptation to variable feature availability without retraining, all while maintaining high predictive accuracy. F-Act achieves this by, first, imputing missing features at test-time and, second, enabling new features to be incorporated through test-time interventions, where F-Act's intermediate predictions space is modified to incorporate the presence of a new feature. Technically, F-Act employs differentiable mask sampling and feature reconstruction to learn to optimally operate from an incomplete set of features. This design enables F-Act to advise on the order features should be collected to maximise their impact, permitting deployment in resource-constrained settings. Using real and synthetic datasets, we evaluate F-Act and find that it matches benchmarks' performance in imputation, feature selection, and prediction, providing recommendations that enhance model performance through adaptive feature incorporation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Background and Related Work</head><p>Imputation and Feature Selection Our work incorporates both feature imputation and feature selection to address limited test-time feature availability. As such, our work is placed at the intersection of these two research subfields. Previous works in feature imputation can be divided into auxiliary model-based approaches and joint learning approaches. In this context, auxiliary-model-based approaches pair prediction models with separate imputation methods <ref type="bibr" target="#b4">[5,</ref><ref type="bibr" target="#b5">6,</ref><ref type="bibr" target="#b6">7]</ref> while joint learning approaches integrate both prediction and feature selection in an end-to-end model <ref type="bibr" target="#b7">[8,</ref><ref type="bibr" target="#b8">9]</ref>. Nevertheless, we emphasise that previously proposed imputation techniques lack feature prioritisation for collection, leading to uncertainty over what features one should prioritise when deploying the model in a setup with varying feature availability. This is a key gap we aim to address with this paper.</p><p>Feature selection techniques <ref type="bibr" target="#b9">[10,</ref><ref type="bibr" target="#b10">11,</ref><ref type="bibr" target="#b11">12,</ref><ref type="bibr" target="#b12">13]</ref>, in contrast, deal with potential feature unavailability (or redundancy) by learning to select a subset of features from which a task can be accurately solved. These approaches commonly achieve this by learning a feature importance ranking that can then inform which subset of features one should select. A shortcoming of these approaches, however, is their inflexibility to a varying set of input features, as, once features have been selected, they require a fixed subset of features to train the downstream model <ref type="bibr" target="#b13">[14]</ref>. As such, in this work we combine feature imputation with feature selection to enable easy adaptability from a core set of initially selected features. We note that combining feature selection and imputation has been previously explored <ref type="bibr" target="#b14">[15,</ref><ref type="bibr" target="#b15">16,</ref><ref type="bibr" target="#b16">17,</ref><ref type="bibr" target="#b2">3]</ref>. However, performing feature selection with joint learning for inference and imputation in a single end-to-end architecture is novel. This is worth exploring, as Bertsimas et al. <ref type="bibr" target="#b7">[8]</ref> and Le Morvan et al. <ref type="bibr" target="#b8">[9]</ref> note that joint learning for imputation and inference can yield improved results.</p><p>Relation to Active Feature Acquisition Active Feature Acquisition (AFA) involves learning a policy for collecting new features at test-time such that a model's accuracy is maximised after observing a small set of features <ref type="bibr" target="#b17">[18,</ref><ref type="bibr" target="#b18">19,</ref><ref type="bibr" target="#b19">20,</ref><ref type="bibr" target="#b20">21]</ref>. As we are interested in providing feature collection recommendations, our work is highly related to AFA. Nevertheless, we highlight that we distinguish ourselves from traditional AFA approaches in two key ways. First, we provide feature collection recommendations at a global level rather than at a local, per-sample level. Second, we learn to both select a subset of features and impute missing features in an end-to-end</p><formula xml:id="formula_0">Mask Module Prediction Module Reconstruction Module ( , )<label>( , ) ( ) ( ) ( , )</label></formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Test-time interventions</head><p>Figure <ref type="figure" target="#fig_3">1</ref>: F-Act. Given a sample x, we apply an element-wise mask m soft ∈ [0, 1] 𝑛 , which acts as a sparsity-inducing global feature selection mask, eliminating noisy features and resulting in x ˜. We then apply an element-wise mask m hard ∈ {0, 1} 𝑛 , sampled from a Gumbel-Softmax distribution with learnable probabilities 𝜋 hard . The masked features are reconstructed by the Reconstruction Module, 𝑓 𝑅 , to approximate x ˜. Subsequently, we perform test-time interventions on this reconstructed sample by re-incorporating 𝑘 previously masked features according to a greedy policy, 𝜇 derived from the rankings of 𝜋 hard , resulting in x ˜𝐼 . Finally, the Prediction Module, 𝑓 𝑃 predicts the sample's label.</p><p>fashion, enabling missing features to be predicted at test-time.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Relation to Human Interpretable Artificial Intelligence Human Interpretable Artificial</head><p>Intelligence (HI-AI) refers to AI systems designed to ensure their decisions and workings are understandable and transparent to humans. Our work is related to HI-AI methods as it enables (1) reconstruction of missing features through test-time imputations, providing insights into a model's understanding of how missing features relate to provided ones, (2) construction of feature importance rankings through its feature collection recommendations, and (3) test-time interventions, where users can provide previously missing features by intervening on F-Act's intermediate predictions using these features' values. As such, our work is related to previous interpretable imputation techniques <ref type="bibr" target="#b21">[22,</ref><ref type="bibr" target="#b6">7]</ref> and methods in the concept-based explainable AI literature <ref type="bibr" target="#b22">[23,</ref><ref type="bibr" target="#b23">24,</ref><ref type="bibr" target="#b24">25,</ref><ref type="bibr" target="#b25">26]</ref> that provide test-time feedback to models via human-aligned concepts.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Feature-wise Active adaptation</head><p>We present a joint learning framework for inference and missing value imputation that offers users insights into which missing feature values should be prioritised for collection and intervention. Formally, our goal is to learn a predictor 𝑓 𝜃 , parameterised by 𝜃, which can operate on any subset of features 𝑆 ⊆ ℱ. Concurrently, the predictor 𝑓 𝜃 should also suggest which missing feature values to collect for intervention at test-time, prioritised by their importance to improve the predictor's performance. We achieve this by introducing F-Act (Figure <ref type="figure" target="#fig_3">1</ref>), a method for the joint learning of feature selection, missing value imputation, and prediction. Our architecture comprises three modules: (i) a Mask module that facilitates feature selection, (ii) a Reconstruction module for feature imputation, and (iii) a Prediction module for making predictions.</p><p>The Mask module serves three objectives: i) global feature selection to eliminate irrelevant features, ii) learning feature importance rankings to provide recommendations for feature collection, and iii) simulating a missing feature scenario to train the Reconstruction module. We achieve all this functionality through hierarchical masking, first employing a soft mask m soft ∈ [0, 1] 𝑑 for feature selection and then a hard mask m hard ∈ {0, 1} 𝑑 to simulate missing features. The hard mask is sampled from a Gumbel-Softmax distribution <ref type="bibr" target="#b26">[27]</ref> with a learnable probability 𝜋 ℎ𝑎𝑟𝑑 .</p><p>To enable predictions from any truncated feature space and facilitate test-time interventions, we use the Reconstruction Module to reconstruct features from the truncated feature space 𝒳 ˜𝑆 generated by the hard mask. The Reconstruction Module outputs "reconstructed" samples, containing feature values for all values, even though they are missing from the original input. The reconstructed samples are then processed by the Prediction Module, which maps from the complete feature space 𝒳 ˜to make the predictions 𝒴. This setup ensures the model can make predictions even when in the presence of missing features.</p><p>To provide feature collection recommendations, we define a greedy intervention policy that corrects reconstructed data based on feature selection probabilities from the mask module. This is the same approach used by feature importance-based selection methods <ref type="bibr" target="#b10">[11,</ref><ref type="bibr" target="#b3">4,</ref><ref type="bibr" target="#b12">13,</ref><ref type="bibr" target="#b11">12]</ref>.</p><p>In order to jointly learn to perform feature selection, missing feature imputation and, downstream task prediction, we train our model using a composite loss function ℒ = ℒ P + 𝛼 𝑆 ℒ S + 𝛼 𝑅 ℒ R where 𝛼 𝑆 and 𝛼 𝑅 are hyperparameters controlling how much we value feature selection (i.e., ℒ S ) and feature reconstruction (i.e., ℒ R ) over task accuracy (i.e., ℒ P ).</p><p>To encourage our model to perform a sparse feature selection, we follow previous works <ref type="bibr" target="#b27">[28,</ref><ref type="bibr" target="#b3">4]</ref> and let ℒ S be the ℓ 1 norm of the soft and hard learnable mask probabilities:</p><formula xml:id="formula_1">ℒ S := 𝑑 ∑︁ 𝑖=1 (︁ 𝜋 soft 𝑖 + 𝜋 hard 𝑖 )︁</formula><p>In contrast, to encourage accurate imputation of masked features, we include a reconstruction loss term ℒ R that minimises the ℓ 2 norm of the difference between reconstructed/imputed feature values and ground truth feature values:</p><formula xml:id="formula_2">ℒ R := 1 |ℱ ∖ 𝑆| ∑︁ 𝑖∈ℱ ∖𝑆 (x ˜𝑖 − 𝑓 𝑅 𝑖 (m hard ⊙ x ˜; 𝜃 𝑅 )) 2</formula><p>where 𝑆 is the set of features selected by the Mask module. This loss encourages our model to learn to select a set of core features from which other, dependent features, may be easily imputed.</p><p>Finally, to enable our model to predict downstream tasks both in and outside the presence of potential feature interventions, we follow the work in <ref type="bibr" target="#b24">[25]</ref> and define our prediction loss, ℒ P , as</p><formula xml:id="formula_3">ℒ P := 𝐿 pred (𝑓 𝜃 (x; 𝜃, 0), X, Y) + 𝜔 𝑘max 𝐿 pred (𝑓 𝜃 (x; 𝜃, 𝑘 max ), X, Y)</formula><p>Here, 𝑓 𝜃 (x; 𝜃, 𝑖) represents the output of the task predictor for sample x when the top-𝑖 dependent features are intervened on and 𝑘 max is the maximum number of interventions one may perform (i.e., the number of dependent features |ℱ ∖ 𝑆|). We clarify that, in this context, an intervention for a feature x 𝑗 involves setting the 𝑗-th feature of the reconstructed features x 𝑅 to x ˜𝑗. This loss, therefore, encourages the model to minimise a task-specific loss (e.g., cross-entropy) before and after interventions, with higher penalties incurred when a mistake is made after a higher number of features have been intervened on at train time (controlled by a hyperparameter 𝜔 &gt; 1). Inference By thresholding mask probabilities, we can identify core necessary features and recommend which features to prioritize for collection. During inference, F-Act imputes missing features dynamically and allows the re-incorporation of non-selected features in the form of test-time interventions. In practice, given an incomplete sample at test-time, we replace the missing values with 0. To perform feature selection, imputation and prediction, we apply the mask, reconstruction and prediction modules in order. A change from the training procedure is that, at test-time, the Gumbel Softmax function's temperature is set to 0, making its performance deterministic. This is equivalent to thresholding the hard mask probabilities at 0.5. Test-time interventions are performed by replacing the reconstructions of the hard-masked features with their true values, as shown in Figure <ref type="figure" target="#fig_3">1</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Experiments</head><p>Datasets and benchmark methods We consider various real-world datasets commonly referenced in feature selection literature. These include image datasets (COIL20 and USPS), a voice audio dataset (Isolet) sourced from <ref type="bibr" target="#b28">[29]</ref>, a synthetic dataset (Madelon) from <ref type="bibr" target="#b29">[30]</ref>, genomic datasets (PBMC <ref type="bibr" target="#b30">[31]</ref> and Mice Protein <ref type="bibr" target="#b31">[32]</ref>), and a financial dataset (Finance) from <ref type="bibr" target="#b32">[33]</ref>.</p><p>Beyond predictive accuracy, we assess F-Act's capabilities in selecting important features and recommending which features to collect at test-time to enhance performance. To this end, we consider several feature selection methods, including LASSO <ref type="bibr" target="#b10">[11]</ref>, Random Forest <ref type="bibr" target="#b12">[13]</ref>, Concrete Autoencoders (CAE) <ref type="bibr" target="#b33">[34]</ref>, XGBoost <ref type="bibr" target="#b11">[12]</ref>, and SEFS <ref type="bibr" target="#b3">[4]</ref>. All methods except CAE rank the features by their importance, which allows us to evaluate F-Act's ability to recommend features for collection. We train each feature selection method on the prediction task, using a simple MLP as a baseline for comparison.</p><p>For missing data imputation, we evaluate four methods: Mean, Iterative Chained Equations (ICE) <ref type="bibr" target="#b5">[6]</ref>, and MissForest <ref type="bibr" target="#b6">[7]</ref>. Additionally, we assess the performance of all combinations of downstream models and imputation methods.</p><p>We train F-Act by minimising the loss 𝐿. We pre-train the reconstruction module following <ref type="bibr" target="#b3">[4]</ref>, with tasks that include reconstructing input vectors and estimating gate vectors. Following this, we tune the intervention number 𝑘 to minimise the prediction loss on the validation dataset.  F-Act outperforms other methods when the features selected for collection at test-time are based on the predictor's learned feature ranking. F-Act outperforms all baseline methods when many features are missing. However, with fewer missing features, F-Act can be outperformed by XGBoost combined with ICE. This indicates that F-Act is especially effective when feature collection can be prioritised.</p><p>For further implementation details, please refer to Appendix A.</p><p>Predictive Accuracy Table <ref type="table" target="#tab_0">1</ref> illustrates the predictive accuracy of F-Act compared to other benchmark methods. F-Act demonstrates competitive performance, consistently ranking in the top three across datasets and outperforming all baselines in three cases. Overall, F-Act ranks the best across datasets. However, besides making predictions, F-Act offers two additional functionalities without any re-training: imputing missing data at test-time and recommending which feature values to collect. We next explore these capabilities.</p><p>Test-time Imputation First, we evaluate F-Act's imputation capabilities when features are missing completely at random (MCAR). To simulate this, we randomly remove features at test-time (without considering the potential to collect these feature values). At low levels of missing features, Figure <ref type="figure" target="#fig_1">2a</ref> shows that F-Act is generally outperformed by Random Forest and MLP, and at higher levels of missing values, it is outperformed by Lasso and MLP. These mixed results suggest F-Act achieves relatively average performance when one cannot utilise its learned feature ranking.</p><p>Second, we consider prioritising test-time feature collection based on the model's learned feature ranking. Figure <ref type="figure" target="#fig_1">2b</ref> shows that when this ranking is used, F-Act outperforms all other methods with only a few features collected.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Test-Time Interventions</head><p>As a feature selection method, F-Act uses a threshold to separate core from non-core features. Unlike standard approaches, F-Act can incorporate non-core features during inference. Figure <ref type="figure" target="#fig_2">3</ref> illustrates that F-Act's performance improves with test-time interventions, sometimes even surpassing the best-performing method on that dataset. For more results, please see Appendix B.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusion</head><p>This paper introduces F-Act, a method combining feature selection and missing data imputation to enable the model to operate when there are missing features at test-time. More importantly, F-Act provides recommendations of which features one should prioritise collecting at testtime to improve the model's performance. Our empirical analysis shows F-Act performs competitively or better than previous baselines in inference tasks with missing features when incorporating feature collection recommendations. Additionally, we show how F-Act can incorporate missing feature values at test-time through test-time interventions, improving performance without retraining and boosting F1 scores across datasets. This work highlights the benefit of designing methods that learn, in an end-to-end fashion, to adapt to different feature availability while providing feature collection recommendations. for 𝑖 = 1 to 𝑛 mb do ◁ For each point in the mini-batch 4:</p><formula xml:id="formula_4">(x 𝑖 , 𝑦 𝑖 ) ∼ (X, Y) ◁ Sample a data point 5:</formula><p>m hard 𝑖 ← Bernoulli(𝜋) ◁ Sample hard mask with probability 𝜋 6:</p><p>x 𝑆 𝑖 ← m hard 𝑖 ⊙ x 𝑖 ◁ Apply hard mask to the data point 7:</p><p>x </p><formula xml:id="formula_5">𝜃 ← 𝜃 − 𝜂∇ 𝜃 ∑︀ 𝑛 mb 𝑖=1 (︀ ℒ R (x 𝑆 𝑖 , x 𝑅 𝑖 ) + 𝛼 pre • CE(m ^hard 𝑖 , m hard 𝑖 )</formula><p>)︀ ◁ Update parameters using gradient descent 11: until convergence hyperparameter tuning and final evaluation phases. We pre-train out model as per Appendix A.3 and train our model as per Algorithm 2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.5. Hyper-parameter Tuning</head><p>Random Forest. For the Random Forest model, we conducted a hyper-parameter sweep on the max_depth parameter. The values considered for max_depth were {3, 5, 7}. This tuning was performed to control the complexity of the individual trees in the forest, with a goal of balancing the bias-variance tradeoff.</p><p>Lasso. In our implementation of the Lasso model, we performed hyper-parameter tuning on two key parameters: l1_ratio and C. The l1_ratio was varied over {0, 0.25, 0.5, 0.75, 1}, allowing us to explore the impact of the ElasticNet mixing parameter which adjusts the balance between L1 and L2 penalties. The C parameter, which controls the inverse of regularisation strength, was swept over {10, 100, 1000}, providing a wide range of regularisation effects.</p><p>XGBoost. For the XGBoost model, we focused our hyper-parameter sweep on the eta (learning rate) and max_depth. The eta values considered were {0.1, 0.3, 0.5}, providing a spectrum of learning rates to control the step size during optimisation. For max_depth, the values were {3, 6, 9}, allowing us to examine different depths for the trees to manage the model's complexity and prevent overfitting.</p><p>Neural Network Based Models. For the neural network-based models, which include MLP, SEFS, CAE, Supervised CAE, and F-Act, we conducted a hyper-parameter sweep. Key parameters included learning rate, lr, ({1e-3, 3e-4, 1e-4}), number of hidden layers ({1, 2, 4}), and dropout rates ({0, 0.2}). Additionally, for the Concrete Autoencoder models (CAE, Supervised CAE), we swept the neurons_ratio over {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0} to explore various proportions of neurons in the encoder and decoder layers. This extensive tuning process was aimed at optimising each of the methods architectures and regularisation techniques to x ˜𝑖 ← m soft ⊙ x 𝑖 ◁ Apply soft mask to input data 8:</p><p>x ˜𝑆𝑖 ← m hard ⊙ x ˜𝑖 ◁ Apply hard mask to soft-masked data Reconstruct the data, apply interventions, and make prediction 9:</p><p>x ˜𝑅𝑖 ← 𝑓 𝑅 (x ˜𝑆𝑖 ; 𝜃 𝑅 ) ◁ Reconstruct the hard-masked features In Figure <ref type="figure" target="#fig_1">2b</ref>, we observe the interesting phenomenon that at low levels of missing data, ICE imputation enables the tree-based models to achieve considerably improved performance. Concurrently, ICE negatively affects the Neural Network and Lasso models. Here, we discuss that phenomenon. Note that, as per the structure of the experiment, the missing features at those levels are relatively lower-ranked features which have less impact on predictions in the case of tree-based models. One possible explanation for the increase in performance of the tree-based models, is that they were initially negatively affected by over-fitting to noise in the lower-ranked features. As a result, the models might benefit from the removal of these features in the test set. This then explains why replacing the lower-ranked features with expected values conditioned by the other observed features of that data point (as per the ICE imputation strategy <ref type="bibr" target="#b21">[22]</ref>) would reduce their noisy impact. To explain the poor performance of ICE with the Neural Network-based and Lasso models, we note that those models are known to be more sensitive to scale and domain shift <ref type="bibr" target="#b34">[35]</ref>. We also note that ICE is known to suffer from misspecification  <ref type="bibr" target="#b35">[36]</ref>, where imputed values are "implausible", falling out of the domain.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.2. Optimal feature availability</head><p>The adaptive nature of our model enables the ability to provide a more finely tuned optimal feature selection recommendation than the standard "selected" vs "non-selected" features dichotomy. Rather than being derived from what is typically an arbitrarily set threshold, we are able to tune the number of selected features without re-training. By contrast, to implement feature selection with methods such as Lasso it is standard to re-train the model with the reduced feature set. In this section, we hypothesise that this functionality will enable improved performance, due to the ability to more finely tune the number of selected features. The "optimal feature selection" is found through post-training hyperparameter tuning of the number of interventions, 𝑘. That is, we evaluate the model at varying degrees of test-time interventions on the validation set, and then set this as the number of features used by the model. In Table <ref type="table" target="#tab_4">3</ref>, we present the results of this approach. In the table, we compare F-Act, which uses post-training hyperparameter tuning of 𝑘, to F-Act "selected only", which only uses only the selected features. We find that in most datasets, this enables an improvement over the "selected only" variant of the model. Exceptions include the finance and mice protein datasets, where F-Act "selected only" outperforms F-Act ῾ 0.005, which is within the standard deviation of our model's F1 score for those datasets. However, on other datasets, such as PBMC, the improvement of F-Act over F-Act "selected only" is rather notably greater than twice the standard deviation. Overall we find that the difference in average F1 score falls within the standard deviation of the model, indicating that the potential gains for this mechanism are limited, however the substantial gains on the PBMC dataset, as well as the small computational costs associated with its implementation, indicate that the mechanism is worth exploring when deploying F-Act.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head></head><label></label><figDesc>(a) Missing Completely at Random (MCAR) (b) Collecting features by model's feature ranking</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Performance varies with the number of missing features at test-time; error bars represent standard deviation. (a) When features are missing at random, several methods outperform F-Act. (b)F-Act outperforms other methods when the features selected for collection at test-time are based on the predictor's learned feature ranking. F-Act outperforms all baseline methods when many features are missing. However, with fewer missing features, F-Act can be outperformed by XGBoost combined with ICE. This indicates that F-Act is especially effective when feature collection can be prioritised.</figDesc><graphic coords="6,89.29,84.19,200.01,166.67" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Test-time interventions boost performance. The x-axis represents the number of interventions. The dotted line marks the peak performance achieved by baselines trained with the full feature set, whereas F-Act operates with only a subset. The results demonstrate that F-Act's performance improves when values for the recommended features are imputed at test-time. In some cases, this enhancement allows F-Act, without retraining, to surpass baselines that need the full feature set at test-time.</figDesc><graphic coords="7,130.96,84.19,333.34,166.81" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Algorithm 1</head><label>1</label><figDesc>Pre-training F-Act Require: Dataset (X, Y), mini-batch size 𝑛 mb , learning rate 𝜂, Loss coefficient 𝛼 pre , mask probability 𝜋 Ensure: Trained model parameters 𝜃 𝑅 1: Initialise parameters 𝜃 𝑅 , 𝜃 pre 2: repeat ◁ Begin pre-training loop 3:</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 Test</head><label>1</label><figDesc>F1 Scores (%) (class-weighted) are presented as the mean ± standard deviation over three seeds. To aggregate the results, we compute the average rank of each method across datasets, where a higher rank indicates superior accuracy. F-Act achieves competitive accuracy and overall ranks higher than other benchmark methods.</figDesc><table><row><cell>Model</cell><cell>COIL20</cell><cell>Isolet</cell><cell>PBMC</cell><cell>USPS</cell><cell>Finance</cell><cell>Madelon</cell><cell cols="2">Mice Protein Avg. Rank</cell></row><row><cell>Lasso</cell><cell cols="7">98.24 ± 0.00 94.58 ± 0.01 89.25 ± 0.14 93.36 ± 0.03 59.78 ± 0.00 51.53 ± 0.00 95.24 ± 2.29</cell><cell>4.29</cell></row><row><cell cols="8">Rand. Forest 96.76 ± 0.18 90.09 ± 0.45 88.66 ± 0.37 93.36 ± 0.15 61.95 ± 0.47 67.19 ± 0.26 96.98 ± 1.82</cell><cell>4.07</cell></row><row><cell>XGBoost</cell><cell cols="7">98.61 ± 0.00 88.75 ± 0.00 89.42 ± 0.00 97.37 ± 0.00 58.83 ± 0.00 80.96 ± 0.00 98.15 ± 0.81</cell><cell>3.07</cell></row><row><cell>SEFS</cell><cell cols="7">94.97 ± 1.40 88.61 ± 2.08 83.17 ± 1.27 92.54 ± 0.75 59.93 ± 0.46 65.23 ± 1.69 85.08 ± 4.59</cell><cell>5.71</cell></row><row><cell>CAE</cell><cell cols="7">97.04 ± 0.87 80.14 ± 1.28 68.07 ± 5.34 90.47 ± 0.83 59.26 ± 1.2 70.19 ± 1.83 85.35 ± 5.37</cell><cell>5.86</cell></row><row><cell>Sup. CAE</cell><cell>6.46 ± 4.45</cell><cell cols="6">3.68 ± 0.79 85.37 ± 0.01 20.90 ± 2.56 54.44 ± 1.80 61.84 ± 0.30 17.24 ± 6.78</cell><cell>7.43</cell></row><row><cell>MLP</cell><cell cols="7">98.83 ± 0.53 93.48 ± 1.29 89.42 ± 0.48 96.78 ± 0.15 57.02 ± 3.96 57.18 ± 0.89 98.30 ± 0.27</cell><cell>3.36</cell></row><row><cell cols="8">F-Act (Ours) 98.84 ± 0.19 92.86 ± 1.18 89.87 ± 0.40 95.95 ± 0.31 59.81 ± 1.90 72.9 ± 2.33 98.46 ± 1.07</cell><cell>2.21</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head></head><label></label><figDesc>𝑅 𝑖 ← 𝑓 𝑅 (x 𝑆 𝑖 ; 𝜃 𝑅 ) ◁ Reconstruct features from masked data 8: m ^hard 𝑖 ← 𝑓 pre (x 𝑆 𝑖 ; 𝜃 pre ) ◁ Predict mask using the pre-training function</figDesc><table><row><cell>9:</cell><cell>end for</cell></row><row><cell>10:</cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Algorithm 2</head><label>2</label><figDesc>Training F-Act Require: Dataset (X, Y), mini-batch size 𝑛 mb , loss coefficients (𝛼 𝑆 , 𝛼 𝑅 ), Gumbel-Softmax temperature parameter 𝜏 , maximum intervention 𝑘 max , learning rate 𝜂 Ensure: Trained model parameters (𝜃 𝑚 soft , 𝜃 𝑚 hard , 𝜃 𝑅 , 𝜃 𝑃 ) 1: Initialise (𝜃 𝑚 soft , 𝜃 𝑚 hard , 𝜃 𝑅 , 𝜃 𝑃 )</figDesc><table><row><cell></cell><cell></cell><cell>◁ Initialise parameter weights randomly</cell></row><row><cell cols="2">2: repeat</cell><cell></cell></row><row><cell>3:</cell><cell>for 𝑖 = 1 to 𝑛 do</cell><cell>◁ For each sample in the training set</cell></row><row><cell>4:</cell><cell>(x 𝑖 , 𝑦 𝑖 ) ∼ (X, Y)</cell><cell>◁ Sample a data point</cell></row><row><cell></cell><cell>Mask the data</cell><cell></cell></row><row><cell>5:</cell><cell>m soft ← Sigmoid(𝜃 𝑚 soft )</cell><cell>◁ Compute soft mask</cell></row><row><cell>6:</cell><cell>m hard ← GumbelSoftmax(𝜃 𝑚 hard , 𝜏 )</cell><cell>◁ Compute hard mask</cell></row><row><cell>7:</cell><cell></cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>Table 3</head><label>3</label><figDesc>F1 Weighted Averages. Comparing F-Act, which uses an "optimal feature selection" found through post-training hyperparameter tuning of the number of interventions, to F-Act "selected only", which uses a feature selection made from thresholding the feature selection probabilities as 50%.</figDesc><table><row><cell>Partition</cell><cell>Validation</cell><cell>Validation</cell><cell>Test</cell><cell>Test</cell></row><row><cell>Model</cell><cell>F-Act (selected only)</cell><cell>F-Act</cell><cell cols="2">F-Act (selected only) F-Act</cell></row><row><cell>Dataset COIL20</cell><cell>0.9929</cell><cell>0.9953</cell><cell>0.9860</cell><cell>0.9884</cell></row><row><cell>Isolet</cell><cell>0.9264</cell><cell>0.9430</cell><cell>0.9197</cell><cell>0.9286</cell></row><row><cell>PBMC</cell><cell>0.8461</cell><cell>0.8793</cell><cell>0.8317</cell><cell>0.8987</cell></row><row><cell>USPS</cell><cell>0.9681</cell><cell>0.9681</cell><cell>0.9683</cell><cell>0.9683</cell></row><row><cell>Finance</cell><cell>0.5992</cell><cell>0.6027</cell><cell>0.6029</cell><cell>0.5981</cell></row><row><cell>Madelon</cell><cell>0.7287</cell><cell>0.7415</cell><cell>0.7225</cell><cell>0.7290</cell></row><row><cell>Mice Protein</cell><cell>0.9815</cell><cell>0.9908</cell><cell>0.9877</cell><cell>0.9846</cell></row></table></figure>
		</body>
		<back>
			<div type="annex">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Reproducibility</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.1. Datasets</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.2. Reproducibility</head><p>Our code is made available at https://github.com/evanrex/feature-wise-active-adaptation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.3. Training Protocol</head><p>We present the training algorithm for our approach in Appendix A.3.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.4. Training and Evaluation Methodology</head><p>We divided the datasets into three subsets: training, validation, and testing, using a 60:20:20 split. We use the validation to select the model's hyperparameters. We evaluate and report the class-weighted F1 score on the test set. The results are averaged across these seeds during both</p></div>			</div>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">An evaluation of machine learning algorithms for missing values imputation</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">H A</forename><surname>Kohbalan Moorthy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">W H</forename><surname>Mohd Arfian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">Mohd</forename><surname>Ismail</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mohamad</forename><surname>Saberi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Internation-alJournal of Innovative Technology and Exploring Engineering</title>
		<imprint>
			<biblScope unit="volume">8</biblScope>
			<biblScope unit="page" from="415" to="420" />
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Cost or price of sequencing? implications for economic evaluations in genomic medicine</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">D</forename><surname>Grosse</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">M</forename><surname>Gudgeon</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Genetics in Medicine</title>
		<imprint>
			<biblScope unit="volume">23</biblScope>
			<biblScope unit="page" from="1833" to="1835" />
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Unsupervised and supervised feature selection for incomplete data via l2, 1-norm and reconstruction error minimization</title>
		<author>
			<persName><forename type="first">J</forename><surname>Cai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Fan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Wu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Applied Sciences</title>
		<imprint>
			<biblScope unit="volume">12</biblScope>
			<biblScope unit="page">8752</biblScope>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Self-supervision enhanced feature selection with correlated gates</title>
		<author>
			<persName><forename type="first">C</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Imrie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Van Der Schaar</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Learning Representations</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">HyperImpute: Generalized iterative imputation with automatic model selection</title>
		<author>
			<persName><forename type="first">D</forename><surname>Jarrett</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">C</forename><surname>Cebere</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Curth</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Van Der Schaar</surname></persName>
		</author>
		<ptr target="https://proceedings.mlr.press/v162/jarrett22a.html" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 39th International Conference on Machine Learning</title>
				<editor>
			<persName><forename type="first">K</forename><surname>Chaudhuri</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><surname>Jegelka</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Song</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">C</forename><surname>Szepesvari</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">G</forename><surname>Niu</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><surname>Sabato</surname></persName>
		</editor>
		<meeting>the 39th International Conference on Machine Learning<address><addrLine>PMLR</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="volume">162</biblScope>
			<biblScope unit="page" from="9916" to="9937" />
		</imprint>
	</monogr>
	<note>Proceedings of Machine Learning Research</note>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Multiple imputation of discrete and continuous data by fully conditional specification</title>
		<author>
			<persName><forename type="first">S</forename><surname>Van Buuren</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Statistical Methods in Medical Research</title>
		<imprint>
			<biblScope unit="volume">16</biblScope>
			<biblScope unit="page" from="219" to="242" />
			<date type="published" when="2007">2007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Missforest-non-parametric missing value imputation for mixed-type data</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">J</forename><surname>Stekhoven</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Bühlmann</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Bioinformatics</title>
		<imprint>
			<biblScope unit="volume">28</biblScope>
			<biblScope unit="page" from="112" to="118" />
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<title level="m" type="main">Beyond impute-then-regress: Adapting prediction to missing data</title>
		<author>
			<persName><forename type="first">D</forename><surname>Bertsimas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Delarue</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Pauphilet</surname></persName>
		</author>
		<idno>ArXiv preprint abs/2104.03158</idno>
		<ptr target="https://arxiv.org/abs/2104.03158" />
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Neumiss networks: differentiable programming for supervised learning with missing values</title>
		<author>
			<persName><forename type="first">M</forename><surname>Le Morvan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Josse</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Moreau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Scornet</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Varoquaux</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in Neural Information Processing Systems</title>
		<imprint>
			<biblScope unit="volume">33</biblScope>
			<biblScope unit="page" from="5980" to="5990" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">A review of feature selection methods on synthetic data</title>
		<author>
			<persName><forename type="first">V</forename><surname>Bolón-Canedo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Sánchez-Maroño</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Alonso-Betanzos</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Knowledge and information systems</title>
		<imprint>
			<biblScope unit="volume">34</biblScope>
			<biblScope unit="page" from="483" to="519" />
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Regression shrinkage and selection via the lasso</title>
		<author>
			<persName><forename type="first">R</forename><surname>Tibshirani</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of the Royal Statistical Society, Series B</title>
		<imprint>
			<biblScope unit="volume">58</biblScope>
			<biblScope unit="page" from="267" to="288" />
			<date type="published" when="1996">1996</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Xgboost: A scalable tree boosting system</title>
		<author>
			<persName><forename type="first">T</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Guestrin</surname></persName>
		</author>
		<idno type="DOI">10.1145/2939672.2939785</idno>
		<ptr target="https://doi.org/10.1145/2939672.2939785.doi:10.1145/2939672.2939785" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</title>
				<editor>
			<persName><forename type="first">B</forename><surname>Krishnapuram</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Shah</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><forename type="middle">J</forename><surname>Smola</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">C</forename><forename type="middle">C</forename><surname>Aggarwal</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">D</forename><surname>Shen</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">R</forename><surname>Rastogi</surname></persName>
		</editor>
		<meeting>the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining<address><addrLine>San Francisco, CA, USA</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2016">August 13-17, 2016. 2016</date>
			<biblScope unit="page" from="785" to="794" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Random forests</title>
		<author>
			<persName><forename type="first">L</forename><surname>Breiman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Machine Learning</title>
		<imprint>
			<biblScope unit="volume">45</biblScope>
			<biblScope unit="page" from="5" to="32" />
			<date type="published" when="2001">2001</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Learning to maximize mutual information for dynamic feature selection</title>
		<author>
			<persName><forename type="first">I</forename><forename type="middle">C</forename><surname>Covert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Qiu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">Y</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">J</forename><surname>White</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S.-I</forename><surname>Lee</surname></persName>
		</author>
		<ptr target="https://proceedings.mlr.press/v202/covert23a.html" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 40th International Conference on Machine Learning</title>
				<editor>
			<persName><forename type="first">A</forename><surname>Krause</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">E</forename><surname>Brunskill</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">K</forename><surname>Cho</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">B</forename><surname>Engelhardt</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><surname>Sabato</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Scarlett</surname></persName>
		</editor>
		<meeting>the 40th International Conference on Machine Learning<address><addrLine>PMLR</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="volume">202</biblScope>
			<biblScope unit="page" from="6424" to="6447" />
		</imprint>
	</monogr>
	<note>Proceedings of Machine Learning Research</note>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Missing value imputation using a novel grey based fuzzy cmeans, mutual information based feature selection, and regression model</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">M</forename><surname>Sefidian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Daneshpour</surname></persName>
		</author>
		<idno type="DOI">10.1016/j.eswa.2018.07.057</idno>
		<ptr target="https://doi.org/10.1016/j.eswa.2018.07.057" />
	</analytic>
	<monogr>
		<title level="j">Expert Systems with Applications</title>
		<imprint>
			<biblScope unit="volume">115</biblScope>
			<biblScope unit="page" from="68" to="94" />
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Feature selection with missing data using mutual information estimators</title>
		<author>
			<persName><forename type="first">G</forename><surname>Doquire</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Verleysen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Neurocomputing</title>
		<imprint>
			<biblScope unit="volume">90</biblScope>
			<biblScope unit="page" from="3" to="11" />
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Combination of knn-based feature selection and knnbased missing-value imputation of microarray data</title>
		<author>
			<persName><forename type="first">P</forename><surname>Meesad</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Hengpraprohm</surname></persName>
		</author>
		<idno type="DOI">10.1109/ICICIC.2008.635</idno>
	</analytic>
	<monogr>
		<title level="m">2008 3rd International Conference on Innovative Computing Information and Control</title>
				<imprint>
			<date type="published" when="2008">2008</date>
			<biblScope unit="page" from="341" to="341" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Active feature-value acquisition</title>
		<author>
			<persName><forename type="first">M</forename><surname>Saar-Tsechansky</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Melville</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Provost</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Management Science</title>
		<imprint>
			<biblScope unit="volume">55</biblScope>
			<biblScope unit="page" from="664" to="684" />
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Active feature acquisition with generative surrogate models</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Oliva</surname></persName>
		</author>
		<ptr target="https://proceedings.mlr.press/v139/li21p.html" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 38th International Conference on Machine Learning</title>
				<editor>
			<persName><forename type="first">M</forename><surname>Meila</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">T</forename><surname>Zhang</surname></persName>
		</editor>
		<meeting>the 38th International Conference on Machine Learning<address><addrLine>PMLR</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="volume">139</biblScope>
			<biblScope unit="page" from="6450" to="6459" />
		</imprint>
	</monogr>
	<note>Proceedings of Machine Learning Research</note>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Joint active feature acquisition and classification with variable-size set encoding</title>
		<author>
			<persName><forename type="first">H</forename><surname>Shim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">J</forename><surname>Hwang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Yang</surname></persName>
		</author>
		<ptr target="https://proceedings.neurips.cc/paper/2018/hash/e5841df2166dd424a57127423d276bbe-Abstract.html" />
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018</title>
				<editor>
			<persName><forename type="first">S</forename><surname>Bengio</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">H</forename><forename type="middle">M</forename><surname>Wallach</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">H</forename><surname>Larochelle</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">K</forename><surname>Grauman</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">N</forename><surname>Cesa-Bianchi</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">R</forename><surname>Garnett</surname></persName>
		</editor>
		<meeting><address><addrLine>NeurIPS; Montréal, Canada</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2018-12-03">2018. December 3-8, 2018. 2018</date>
			<biblScope unit="page" from="1375" to="1385" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Active feature acquisition with generative surrogate models</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Oliva</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Machine Learning</title>
				<meeting><address><addrLine>PMLR</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="6450" to="6459" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Multiple imputation by chained equations: what is it and how does it work?</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">J</forename><surname>Azur</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">A</forename><surname>Stuart</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Frangakis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">J</forename><surname>Leaf</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International journal of methods in psychiatric research</title>
		<imprint>
			<biblScope unit="volume">20</biblScope>
			<biblScope unit="page" from="40" to="49" />
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">Concept bottleneck models</title>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">W</forename><surname>Koh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Nguyen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><forename type="middle">S</forename><surname>Tang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Mussmann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Pierson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Liang</surname></persName>
		</author>
		<ptr target="http://proceedings.mlr.press/v119/koh20a.html" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 37th International Conference on Machine Learning, ICML 2020</title>
				<meeting>the 37th International Conference on Machine Learning, ICML 2020<address><addrLine>PMLR</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2020-07">July 2020. 2020</date>
			<biblScope unit="volume">119</biblScope>
			<biblScope unit="page" from="5338" to="5348" />
		</imprint>
	</monogr>
	<note>Proceedings of Machine Learning Research</note>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">Concept embedding models: Beyond the accuracy-explainability trade-off</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">Espinosa</forename><surname>Zarlenga</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Barbiero</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Ciravegna</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Marra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Giannini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Diligenti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Shams</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Precioso</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Melacci</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Weller</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in Neural Information Processing Systems</title>
		<imprint>
			<biblScope unit="volume">35</biblScope>
			<biblScope unit="page" from="21400" to="21413" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">Learning to receive help: Intervention-aware concept embedding models</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">Espinosa</forename><surname>Zarlenga</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Collins</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Dvijotham</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Weller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Shams</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Jamnik</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in Neural Information Processing Systems</title>
		<imprint>
			<biblScope unit="volume">36</biblScope>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<monogr>
		<title level="m" type="main">Beyond concept bottleneck models: How to make black boxes intervenable?</title>
		<author>
			<persName><forename type="first">R</forename><surname>Marcinkevičs</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Laguna</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Vandenhirtz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">E</forename><surname>Vogt</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2401.13544</idno>
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b26">
	<monogr>
		<author>
			<persName><forename type="first">E</forename><surname>Jang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Poole</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1611.01144</idno>
		<title level="m">Categorical reparameterization with gumbel-softmax</title>
				<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">Weight predictor network with feature selection for small sample tabular biomedical data</title>
		<author>
			<persName><forename type="first">A</forename><surname>Margeloiu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Simidjievski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Lio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Jamnik</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the AAAI Conference on Artificial Intelligence</title>
				<meeting>the AAAI Conference on Artificial Intelligence</meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="volume">37</biblScope>
			<biblScope unit="page" from="9081" to="9089" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b28">
	<analytic>
		<title level="a" type="main">Feature selection: A data perspective</title>
		<author>
			<persName><forename type="first">J</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Cheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Morstatter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">P</forename><surname>Trevino</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Tang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Liu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM Computing Surveys (CSUR)</title>
		<imprint>
			<biblScope unit="volume">50</biblScope>
			<biblScope unit="page">94</biblScope>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b29">
	<monogr>
		<title level="m" type="main">UCI Machine Learning Repository</title>
		<author>
			<persName><forename type="first">I</forename><surname>Guyon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Madelon</forename></persName>
		</author>
		<idno type="DOI">10.24432/C5602H</idno>
		<idno>DOI:</idno>
		<ptr target="https://doi.org/10.24432/C5602H" />
		<imprint>
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b30">
	<analytic>
		<title level="a" type="main">Joint probabilistic modeling of paired transcriptome and proteome measurements in single cells</title>
		<author>
			<persName><forename type="first">A</forename><surname>Gayoso</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Steier</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Lopez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Regier</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">L</forename><surname>Nazor</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Streets</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Yosef</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Biorxiv</title>
		<imprint>
			<biblScope unit="page" from="2020" to="2025" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b31">
	<monogr>
		<title level="m" type="main">Mice protein expression, UCI Machine Learning Repository</title>
		<author>
			<persName><forename type="first">C</forename><surname>Higuera</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Gardiner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Cios</surname></persName>
		</author>
		<idno type="DOI">10.24432/C50S3Z</idno>
		<idno>DOI:</idno>
		<ptr target="https://doi.org/10.24432/C50S3Z" />
		<imprint>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b32">
	<monogr>
		<author>
			<persName><forename type="first">N</forename><surname>Carbone</surname></persName>
		</author>
		<ptr target="https://www.kaggle.com/datasets/cnic92/200-financial-indicators-of-us-stocks-20142018" />
		<title level="m">200+ financial indicators of us stocks</title>
				<imprint>
			<date type="published" when="2014">2014-2018. 2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b33">
	<analytic>
		<title level="a" type="main">Concrete autoencoders: Differentiable feature selection and reconstruction</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">F</forename><surname>Balın</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Abid</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zou</surname></persName>
		</author>
		<ptr target="https://proceedings.mlr.press/v97/balin19a.html" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 36th International Conference on Machine Learning</title>
				<editor>
			<persName><forename type="first">K</forename><surname>Chaudhuri</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">R</forename><surname>Salakhutdinov</surname></persName>
		</editor>
		<meeting>the 36th International Conference on Machine Learning<address><addrLine>PMLR</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="volume">97</biblScope>
			<biblScope unit="page" from="444" to="453" />
		</imprint>
	</monogr>
	<note>Proceedings of Machine Learning Research</note>
</biblStruct>

<biblStruct xml:id="b34">
	<analytic>
		<title level="a" type="main">Deep learning</title>
		<author>
			<persName><forename type="first">R</forename><surname>Salakhutdinov</surname></persName>
		</author>
		<idno type="DOI">10.1145/2623330.2630809</idno>
		<idno>doi:10.1145/2623330.2630809</idno>
		<ptr target="https://doi.org/10.1145/2623330.2630809" />
	</analytic>
	<monogr>
		<title level="m">The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD &apos;14</title>
				<editor>
			<persName><forename type="first">S</forename><forename type="middle">A</forename><surname>Macskassy</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">C</forename><surname>Perlich</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Leskovec</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">W</forename><surname>Wang</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">R</forename><surname>Ghani</surname></persName>
		</editor>
		<meeting><address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="1973">August 24 -27, 2014. 2014. 1973</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b35">
	<analytic>
		<title level="a" type="main">Multiple imputation using chained equations: issues and guidance for practice</title>
		<author>
			<persName><forename type="first">I</forename><forename type="middle">R</forename><surname>White</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Royston</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">M</forename><surname>Wood</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Statistics in medicine</title>
		<imprint>
			<biblScope unit="volume">30</biblScope>
			<biblScope unit="page" from="377" to="399" />
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
