<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Feature Selection Using Quantum Annealing: A Mutual Information Based QUBO Approach</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Muhammad Talha Shaikh</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Muhammad Hamza</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Syed Bilal Ali</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Muhammad Rafi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sumaiyah Zahid</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National University of Computer and Emerging Sciences -FAST</institution>
          ,
          <addr-line>St-4, Sector 17-D, N-5, Karachi</addr-line>
          ,
          <country country="PK">Pakistan</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents a quantum-inspired feature selection approach to Learning to Rank (LTR) on the MQ2007 collection of Quantum CLEF 2025 Task 1A. The goal is to find a minimal yet informative subset of features that maximizes ranking performance with respect to NDCG@10. We cast feature selection as a Binary Quadratic Model (BQM), where coeficients are derived from Mutual Information (MI) and Conditional Mutual Information (CMI). The resulting Quadratic Unconstrained Binary Optimization (QUBO) problems are optimized using Quantum Annealing (QA) on quantum hardware and Simulated Annealing (SA) on traditional processors. Experimental results show that QA obtains an optimal NDCG@10 value of 0.44, outperforming SA and with considerable computation time reduction.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Quantum Annealing</kwd>
        <kwd>Feature Selection</kwd>
        <kwd>QUBO</kwd>
        <kwd>Simulated Annealing</kwd>
        <kwd>Quantum Machine Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Current research suggests that QA can be superior to traditional methods for specific combinatorial
optimization problems, particularly if the problem is amenable to the targeted quantum hardware [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
Experimental eficiency in QA relies significantly on problem definition and on mapping quality to
quantum hardware, with encouraging initial results seen in industrial applications [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>Herein, we investigate the application of both QA and SA for feature subset selection on the MQ2007
LTR dataset. Our approach maps MI- and CMI-based relevance and redundancy estimation into QUBO
formulation and uses quantum and simulated annealing to identify highly informative and diverse
subsets of features. We then evaluate the efect of selected subset using the NDCG@10 metric together
with visualization and comparative analysis to study feature importance and interrelationships. This
dual investigation is intended to better understand the trade-ofs between quantum and classical
annealing in practical feature selection pipelines.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background and Motivation</title>
      <sec id="sec-2-1">
        <title>2.1. Feature Selection and Its Significance</title>
        <p>Feature selection involves reducing dimensionality by identifying the most impactful features from a
larger set. Beyond boosting model accuracy, its advantages include:
• Avoiding the Curse of Dimensionality: As the dimensionality grows, the feature space
becomes more and more sparse exponentially, and models find it hard to generalize. Feature
selection assists in focusing learning on the most relevant variables.
• Less Computational Cost: The reduction of feature numbers leads to a reduction of input
vector sizes, which in turn enables accelerated model training, decreases inference time, and
decreases memory consumption—this is especially beneficial for large ranking tasks.
• Improved Generalization: Models that are trained on data of high dimensionality will overfit,
particularly if there are irrelevant or redundant features. Feature selection improves generalization
by concentrating on the essential predictive features.
• Improved Model Explainability: A streamlined, carefully curated set of features is simpler to
understand the model’s decision process and behavior, which is critical in ranking applications
where explainability is important.
• Noise Reduction: Elimination of features that are irrelevant or weakly related eliminates noise
in learning and yields stronger and more stable models.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. MI-Based Formulation of Feature Selection</title>
        <p>The objective is to choose features that hold maximal information about the target label, quantified
using Mutual Information (MI) and Conditional Mutual Information (CMI). MI measures how
much knowing a feature reduces uncertainty about the label. CMI extends this to measure additional
information gained from one feature conditional on another.</p>
        <p>Given  = {1, 2, . . . , } and label  , the goal is to maximize:
∑︁ (;  ) + ∑︁ ( ;  |)
∈ ,∈</p>
        <p≯=</p>
        <p>Where  ⊆  is the selected feature subset.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. QUBO Encoding via MI and CMI</title>
        <p>The above MI-based objective can be converted into a QUBO as follows:
• Linear terms  ← − (;  )
• Quadratic terms  ← − ( ;  |)
This yields the QUBO:
min
x∈{0,1}
x Qx</p>
        <p>Where x is the binary vector indicating selected features.</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Intuition Behind Quantum Annealing</title>
        <p>Traditional feature selection algorithms have to search 2 subsets of  features, which is intractable for
moderately sized . Quantum Annealing, through the application of superposition, allows us to search
an exponentially large search space in polynomial time in certain instances. It’s important to know,
however, that although a quantum system can exist in all possible states at once during the evolution,
it collapses to one result when measured. Therefore, the construction of good QUBOs that steer the
system to high quality optima is essential.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Feature Selection Methodology</title>
      <sec id="sec-3-1">
        <title>3.1. Problem Formulation</title>
        <p>The MQ2007 dataset consists of query-document feature vectors  = {1, 2, . . . , } with
corresponding relevance labels  . The objective is to identify a subset  ⊆  of fixed size  that collectively
maximizes feature relevance to the target while minimizing redundancy among selected features,
thereby improving overall ranking performance. This problem is cast as a Binary Quadratic Model
(BQM), which is then minimized by the quantum annealer.</p>
        <p>Formally, given  = {1, 2, ..., } and label Y, the goal is to minimize the following objective
function, which directly maps to the BQM formulation:
⎡</p>
        <p>⎤
min ⎢⎢ ∑︁ − (;  ) +
 ⎣∈
,∈
&lt;
∑︁ − (;  |  )⎥⎥
⎦</p>
        <p>Here, − (;  ) encourages the inclusion of features with high relevance to  (as minimizing a
large negative value makes it less negative, closer to zero, or positive), and − (;  |  ) penalizes
redundancy among selected features (as minimizing a large negative value for redundant features
contributes less to the overall minimum). The fixed cardinality || =  is imposed using a soft
constraint, typically enforced via a penalty term in the BQM framework.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Mutual and Conditional Mutual Information Estimation</title>
        <p>The information-theoretic quantities (;  ) and (;  |  ) are computed using histogram-based
entropy estimation. First, joint probability distributions are computed by discretizing feature vectors
into bins. Entropy is calculated using the Shannon formulation:</p>
        <p>For a feature  and target  , the mutual information is computed as:
Conditional mutual information for feature pairs (,  ) is defined as:
() = −
∑︁ () log2 ()</p>
        <p>(;  ) = () + ( ) − (,  )
(;  |  ) = (,  ) + ( ,  ) − (,  ,  ) − ( )</p>
        <p>In practice, joint histograms are estimated using np.histogramdd, and entropies are computed
directly from normalized joint probabilities. A bin size of 10 is used for 2D MI estimation, while a
smaller bin size of 6 is used for 3D CMI to mitigate sparsity.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Binary Quadratic Model Construction</title>
        <p>{0, 1} indicates whether feature  is selected. The energy function is defined as:
The optimization is mapped onto a Binary Quadratic Model (BQM), where each binary variable  ∈
() = ∑︁  + ∑︁   + 

&lt;
︃(
∑︁  − 

)︃2
where:
•  = − (;  ) encourages inclusion of features with high relevance to  .</p>
        <p>negative value for redundant features contributes less to the overall minimum).
•  = − (;  |  ) penalizes redundancy among selected features (as minimizing a large
• The last term is a soft cardinality constraint with strength  = 10.0.</p>
        <p>The BQM is constructed by iterating over all features and feature pairs to compute MI and CMI values,
which are stored for downstream analytics. The constraint enforcing exactly  features is generated
and merged into the BQM.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Simulated Annealing</title>
        <p>Simulated Annealing (SA) is accessed by using the neal sampler. Simulated annealing simulates
temperature variations, traversing through the Binary Quadratic Model’s (BQM) energy landscape by
stochastically modifying binary variables according to an imposed temperature schedule. A single run
is 2,000 reads, and the lowest energy sample is used as the best subset of features.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Quantum Annealing</title>
        <p>Quantum Annealing (QA) is run on the D-Wave quantum processing unit (QPU) by creating the
EmbeddingComposite structure. The quantum sampler attempts to find the global optimum of the
Binary Quadratic Model (BQM) by evolving an initial quantum superposition according to an energy
Hamiltonian derived from the problem parameters.</p>
      </sec>
      <sec id="sec-3-6">
        <title>3.6. Feature Extraction and Submission Formatting</title>
        <p>Once the annealing process concludes, the resulting samples are processed to extract the selected feature
indices. Features corresponding to binary values of 1 are extracted, and their original 1-based indices
are saved. A submission file is generated for each annealing method, containing:
• A sorted list of selected feature indices.</p>
        <p>• A trailing line containing the problem identifier (submission ID) for traceability.</p>
      </sec>
      <sec id="sec-3-7">
        <title>3.7. Summary</title>
        <p>The provided methodology enables efective feature selection through hybrid classical-quantum
optimisation. The integration of MI-based relevance, CMI-based redundancy management, and
cardinalityenforcing BQM formulation enables fine-grained feature subset control. The joint use of SA and QA
enables submission or hardware resilience, and generates competitive and analytically interpretable
feature selections.</p>
      </sec>
      <sec id="sec-3-8">
        <title>3.8. Experimental Setup</title>
        <p>
          Data Preparation: The MQ2007 dataset consists of 46 numerical features per query-document pair,
along with relevance labels. The data preprocessing steps include:
• Min-max normalization: Each feature is scaled to the [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ] range independently to ensure
uniform contribution to mutual information calculations and compatibility with the binary nature
of the BQM formulation.
• Discretization: To compute entropy-based metrics, all continuous features are discretized into a
ifxed number of bins (default is 10). This binning enables the estimation of empirical probability
distributions required for MI and CMI computations.
• MI and CMI estimation: Mutual Information (MI) is computed between each feature and the
target label (relevance). Conditional Mutual Information (CMI) is also estimated for each feature
pair conditioned on the label, which helps capture joint relevance. These scores are later used to
form the coeficients of the BQM.
        </p>
        <p>BQM Construction: A Binary Quadratic Model (BQM) is constructed where:
• Linear terms: The linear coeficients represent the negative Mutual Information scores, i.e.,
higher MI is encouraged by minimizing the BQM objective.
• Quadratic terms: The quadratic coeficients are based on the negative Conditional Mutual</p>
        <p>Information between feature pairs. This penalizes the selection of redundant features.
• Soft constraint on feature count: To enforce approximately  = 10 features to be selected, an
auxiliary quadratic penalty term is added to the BQM. This term penalizes any solution where
the sum of selected binary variables is not equal to , using a high penalty coeficient (e.g., 5.0).</p>
      </sec>
      <sec id="sec-3-9">
        <title>3.9. Evaluation Pipeline</title>
        <p>Once the BQM is constructed, the feature selection and evaluation proceed as follows:
• Sampling: The BQM is solved by two approaches: Quantum Annealing on D-Wave and Simulated
Annealing based on the Neal sampler. For both samplers, the lowest-energy solution is employed,
a binary vector of the chosen features.
• Feature subset selection: The feature indices of the corresponding active bits (1s) of the lowest
energy sample are calculated. These are used to build a compact feature matrix by selecting the
corresponding columns of the original data.
• Model training: The RankSVM model is trained on the smaller training set. Default
hyperparameters are used in the RankSVM configuration, which is trained using query-specific splits in
the training data.
• Evaluation: It is assessed with NDCG@10 (Normalized Discounted Cumulative Gain at position
10) on the provided test set, in accordance with the provided Quantum CLEF Task 1A [8, 9]
evaluation protocol. This specific measure assesses the quality of ranking obtained from the top
10 documents retrieved per query.</p>
      </sec>
      <sec id="sec-3-10">
        <title>3.10. Submission and Visualization</title>
        <p>Final submission files include the 1-based indices of selected features and the associated problem ID
(from D-Wave or fallback SA runs). The analytics module produces the following visualizations:
• Top-20 Mutual Information Bar Chart: Displays the MI scores of the most relevant features.
• Correlation Heatmap: Plots Pearson correlation between features selected by SA and QA.
• Feature Selection Comparison: Venn-style bar chart showing overlap and divergence in
features selected by QA vs. SA.</p>
        <p>These diagnostics enable qualitative assessment of feature quality and algorithmic behavior under
quantum and classical regimes.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <sec id="sec-4-1">
        <title>4.1. Performance Comparison</title>
        <p>Quantum Annealing (QA) consistently outperformed Simulated Annealing (SA) across all metrics. The
best QA submission (NDCG@10 = 0.4409) surpassed the best SA score (NDCG@10 = 0.4212) while also
achieving substantially lower annealing times (286,966  s vs. 4,072,677  s). All QA-based submissions
clustered around high performance, showcasing the robustness of the quantum annealing approach.
The results highlight QA’s ability to explore the feature space more eficiently and efectively than
classical SA methods.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Visualization and Analysis</title>
      <p>This bar chart (Figure 1) illustrates the varying relevance of individual features, showing that some
features (e.g., Feature 23, 39, 21) exhibit significantly higher mutual information with the target label,
indicating their strong individual predictive power. This helps in understanding the baseline information
content of the most relevant features.</p>
      <p>Figure 2 visualizes the overlap and unique feature selections between Quantum Annealing (QA) and
Simulated Annealing (SA). The "Common Features" bar indicates the set of features identified by both
methods, while "QA Only" and "SA Only" show features uniquely selected by each. This comparison
highlights that while there’s a significant common subset, both annealing approaches also identify
distinct features, suggesting diferent exploration patterns or optima found within the feature space.</p>
      <p>The QA and SA methods selected partially overlapping sets of features, but QA demonstrated
better diversity and lower intra-set redundancy, as evident in the correlation matrix (Figure 3). This
heatmap displays the Pearson correlation coeficients between the selected features. Lighter areas
(approaching white) within the heatmap indicate correlation values close to zero, meaning there is
minimal linear relationship and thus low redundancy between those specific feature pairs. This is a
desirable characteristic for a feature subset, as it suggests the selected features contribute unique and
non-overlapping information, leading to improved model robustness.</p>
      <p>The QA and SA methods selected partially overlapping sets of features, but QA demonstrated better
diversity and lower intra-set redundancy, as evident in the correlation matrix (Figure 3).</p>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion</title>
      <p>Our findings demonstrate the promise of quantum annealing for efective and eficient feature selection
in ranking tasks. Our speeds-up in annealing time and enhanced ranking performance indicate that
quantum approaches may augment or even supersede conventional metaheuristics as quantum hardware
grows.</p>
      <p>However, we also notice some limitations including variation in QA outcomes between runs, limited
control of embedding on the QPU, and sensitivity to the accuracy of MI estimation. All these aspects
are deserving of further study.</p>
      <sec id="sec-6-1">
        <title>6.1. Conclusion and Future Work</title>
        <p>We demonstrated a hybrid quantum-classical pipeline for feature selection in learning-to-rank tasks
using the MQ2007 dataset. Our Quantum Annealing-based approach showed promising results both
in performance and computation time. Future work will explore scaling the method to larger feature
spaces, integrating alternative quantum algorithms (e.g., QAOA), and improving mutual information
estimators tailored for sparse ranking labels.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgements</title>
      <p>We thank the Quantum CLEF 2025 organizers and the D-Wave team for providing access to quantum
annealing hardware.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used GPT-4 to: Grammar and spelling check. After
using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full
responsibility for the publication’s content.
[8] A. Pasin, M. F. Dacrema, W. Cuhna, M. A. Gonçalves, P. Cremonesi, N. Ferro, QuantumCLEF 2025:
Overview of the second quantum computing challenge for information retrieval and recommender
systems at CLEF, in: Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum,
CEUR Workshop Proceedings, 2025.
[9] A. Pasin, M. F. Dacrema, W. Cuhna, M. A. Gonçalves, P. Cremonesi, N. Ferro, Overview of
QuantumCLEF 2025: The second quantum computing challenge for information retrieval and
recommender systems at CLEF, in: Experimental IR Meets Multilinguality, Multimodality, and
Interaction. Proceedings of the Sixteenth International Conference of the CLEF Association (CLEF
2025), Lecture Notes in Computer Science, Springer, 2025.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>I.</given-names>
            <surname>Guyon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Elisseef</surname>
          </string-name>
          ,
          <article-title>An introduction to variable and feature selection</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          <volume>3</volume>
          (
          <year>2003</year>
          )
          <fpage>1157</fpage>
          -
          <lpage>1182</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Long</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <article-title>Feature selection based on mutual information: criteria of maxdependency, max-relevance, and min-redundancy</article-title>
          ,
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          <volume>27</volume>
          (
          <year>2005</year>
          )
          <fpage>1226</fpage>
          -
          <lpage>1238</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>F.</given-names>
            <surname>Fleuret</surname>
          </string-name>
          ,
          <article-title>Fast binary feature selection with conditional mutual information</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          <volume>5</volume>
          (
          <year>2004</year>
          )
          <fpage>1531</fpage>
          -
          <lpage>1555</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Haque</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fahim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ibrahim</surname>
          </string-name>
          ,
          <article-title>An exploratory study on simulated annealing for feature selection in learning-to-rank</article-title>
          , https://arxiv.org/abs/2310.13269,
          <year>2023</year>
          . ArXiv preprint arXiv:
          <volume>2310</volume>
          .
          <fpage>13269</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Lucas</surname>
          </string-name>
          ,
          <article-title>Ising formulations of many NP problems</article-title>
          ,
          <source>Frontiers in Physics 2</source>
          (
          <year>2014</year>
          )
          <article-title>5</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Mücke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Heese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wolter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Piatkowski</surname>
          </string-name>
          ,
          <article-title>Feature selection on quantum computers</article-title>
          ,
          <source>Quantum Machine Intelligence</source>
          <volume>4</volume>
          (
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>11</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Yarkoni</surname>
          </string-name>
          , E. Raponi,
          <string-name>
            <given-names>T.</given-names>
            <surname>Bäck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schmitt</surname>
          </string-name>
          ,
          <article-title>Quantum annealing for industry applications: introduction and review</article-title>
          ,
          <source>Reports on Progress in Physics 85</source>
          (
          <year>2022</year>
          )
          <fpage>104001</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>