Team shm2024 on Quantum Feature Selection
                         Notebook for the QuantumCLEF Lab at CLEF 2024

                         Shimi Gersome1 , Jerin Mahibha1 and Durairaj Thenmozhi1

                         1
                             Madras Christian College, Chennai,India
                         1
                             Meenakshi Sundararajan Engineering College, Chennai, India
                         1
                             Sri Sivasubramaniya Nadar college of Engineering, Chennai,India


                                        Abstract
                                        Quantum has always gained considerable attention in scientific studies as it defies common perception. Quantum
                                        Computing- recently evolving- has also started to gain considerable attention. Quantum Computers can solve,
                                        some of the distinct challenges in Nondeterministic polynomials (NP)–Hard problems faster than traditional
                                        computers. This work is based on the implementation of Task1-feature selection in a shared task, QCLEF2024
                                        with MQ2007 dataset containing 46 features. The task is performed using simulated annealing and Quantum
                                        Annealing. The performance of both annealing methods is analyzed based on ndcg@10 (Normalized Discounted
                                        Cumulative Gain) and Annealing Time. We obtained a result of 0.3621 and 27222 milliseconds for ndcg@10
                                        and Annealing Time respectively while using Quantum Annealing. We obtained a result of 0.4024 and 284106
                                        milliseconds for ndcg@10 and Annealing Time respectively while using simulated Annealing.


                                        Keywords
                                        Quantum Computing, NP-Hard, Annealing, Simulated, feature selection,


                         1. Introduction
                         Quantum is a multidisciplinary research topic revolving around the current era. Presently Quantum
                         computing is an emerging field in Computer Science, which works on the concept of quantum mechanics,
                         it is a combination of Physics, Mathematics, and Computer Science. Quantum computing works on
                         the Quantum bits (qubits), which is the basic unit of Quantum information. The performance of
                         Quantum Computing excels more than typical computers since the computation is done in qubits not in
                         Bits(Binary Digits)[1]. The single qubit holds the state 0,1 or any quantum superposition (through the
                         principle of superposition) of the states and has the potential to compute multiple states concurrently.
                         Quantum Computing algorithms operate on the concept of quantum mechanics superposition and
                         entanglement, which enhance the computational speed exponentially in some distinct areas. Machine
                         Learning(ML) algorithms are intended for predictions, feature selection, classification, and clustering
                         using mathematical formulations. The performance of the ML algorithms is based on data and data
                         size. Quantum Machine Learning(QML) incorporates the ML and Quantum Computing concepts to
                         enable exponentially faster computation. QML aims to optimize current ML techniques by leveraging
                         the performance benefits of quantum algorithms[2]. The shared task of QCLEF2024 Task1 was aimed
                         to perform feature selection by Quantum Computing using MQ2007 Dataset containing 47 features.
                         Feature selection is an essential process in ML to select the subset of relevant features to improve the
                         model performance from the available features. It reduces the dimensionality, complexity, overfitting,


                          CLEF 2024: Conference and Labs of the Evaluation Forum, September 9–12, 2024, Grenoble, France
                          $ gshimi2022@gmail.com (S. Gersome); jerinmahibha@gmail.com (J. Mahibha); theni_d@ssn.edu.in (D. Thenmozhi)
                                     © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
and computational cost, which improves model generalization. Similarly QML feature selection de-
creases the curse of dimensionality, computational complexity, and effect of overfitting and solved
through heuristics [3].


2. Related Work
Wang [3] had put forward the idea of Quantum Support Vector Feature Selection(QSVMF) which had
optimized maximizing classification accuracy, minimizing selected features and quantum circuit costs,
and reducing feature covariance. QSVMF approach had been implemented on the breast cancer dataset
and relevant biomarkers had been obtained.
Poggiali et al. [4] had focused on novel approach to Hybrid Quantum Feature Selection(HQFS), done on
two synthesized datasets and one real dataset by estimating variance. The ranking produced had been
similar to classical algorithms and the low variance features had been perfectly eliminated by HQFS if
the number of additional qubits were correctly selected.
Hellstern et al. [5] had recommended a quadratic unconstrained optimization problem (QUBO) for
feature selection. For smaller datasets, QUBO had outperformed the classical numerical method. For
the Larger dataset Classical method had performed better.
Bhagawati and Subramanian [6] had used Quantum Annealing (QA) to solve the Quadratic Uncon-
strained Binary Optimization(QUBO), which had adopted optimization theory to minimize or maximize
the quadratic objective function. It had been hypothesized that when this hybrid model was used the
Feature Selection(FS) had worked comparatively better during the ranking process when using the
LETOR dataset. They had found that compared to the standard algorithms(LTR and LamdaMART), the
hybrid had yielded a normalized discounted cumulative gain of 0.39 and 0.80, respectively.


3. Methodology
This experimental research is carried out in the workspace provided by the QCLEF2024 using the
MQ2007 Dataset for Task1A.[7] LETOR(Learning to Rank for Information Retrieval), is a benchmark
dataset on information retrieval for research containing standard features, relevance judgments, data
partitioning, evaluation tools, and several baselines. MQ2007 is one of the query sets from LETOR
released in 2007 1 .
The MQ2007 dataset contains 47 features among those features one feature is the target feature with
the size 41955. The label named target of the MQ2007 dataset is labeled by human experts as “relevant”,
“partially relevant”, and “not relevant”. MQ2007 dataset is in text format, and all the features are of
datatype float64 except target, which is of int64 datatype containing values 0,1 or 2.
The experimental research of shared Task 1A, feature selection of QCLEF2024 was carried out by
Simulated Annealing(SA) and Quantum Annealing(QA), and the performance comparison was made
using different parameters and selected the most relevant 5 features from the available features.


3.1. Model Description
The Binary Quadratic Model(BQM) is a collection of binary-valued variables that can hold any two
values associated with linear and quadratic biases. Quadratic Unconstrained Binary Optimization
(QUBO) models are used by samplers such as the D-Wave system. D-wave uses the process of Quantum
Annealing to discover results to solve some distinct types of NP-Hard problems. The problem can be
minimized using the minimization formula
1
    https://www.microsoft.com/en-us/research/project/letor-learning-rank-information-retrieval/
∑︀𝑁              ∑︀𝑁
  𝑖   𝑞𝑖 𝑥 𝑖 +    𝑖<𝑗 𝑞 𝑖 ,𝑗 𝑥𝑖 𝑥𝑗

𝑞𝑖 and 𝑞𝑖,𝑗 denotes linear and quadratic coefficients.
QUBO is the minimization energy. To extract useful information from the dataset, mutual information
and conditional information are used, The dataset is quantified using the Shannon Entropy formula
H(X)=− 𝑥𝜀𝑋 p(x) 𝑙𝑜𝑔 𝑝(𝑥)
       ∑︀

p(x) denotes the probability of occurrence of an event.
For Conditional Shannon Entropy (CSE) is calculated if Y is known and X is unknown.
H(X|Y) = H(X,Y) - H(Y)
H(X|Y)=− 𝑥𝜀𝑋 p(x,y) 𝑙𝑜𝑔 𝑝(𝑥, 𝑦) - 𝐻(𝑌 )
          ∑︀

P(x,y) denotes the probability of x and y.
H(X, Y) denotes information X and Y together.
. The features interact with each other using qubits. So the features are quantified using Shannon
Entropy(SE) and normalized using Conditional Shannon Entropy(CSE). The correlation or dependency
of the target features with the other features is calculated using the mutual information formula
                            𝑝(𝑥,𝑦)
I(X;Y)= 𝑦𝜀𝑌 𝑥𝜀𝑋 p(x,y) 𝑙𝑜𝑔 𝑝(𝑥)𝑝(𝑦)
       ∑︀  ∑︀


p(x) and p(y) denotes marginal probability

p(x,y) denotes joint probability

The variable of interest between the target feature and other features is calculated and stored as
another feature. This calculation is performed by the formula

I(X;Y|Z)=𝐻(𝑋|𝑍) - 𝐻(𝑋|𝑌, 𝑍)

H(X|Z) denotes CSE of X on Z

H(X|Y,Z) denotes CSE of X on Y and Z


4. Experimental Setup
4.1. Dataset
The MQ2007, a LETOR (LEarning TO Rank) document retrieval dataset consists of user queries and the
corresponding retrieved documents fetched by the query[8]. The retrieved documents are labeled by
human experts, as “relevant”, “partially relevant”, or “not relevant”. [9]. The dataset for shared task
Task1A, Feature Selection by QCLEF2024 is read from the text file[10].


5. Results
This Task1A is accomplished in the workspace provided by the QCLEF2024 using the MQ2007 Dataset.
The dataset has 47 features, among them one is used as a target. We selected only 5 important features
Figure 1: Mutual Information between Target and Feature


   Table 1
   Performance metrics for Quantum Annealing
                  Run             ndcg@10        Annealing       No. of Features
                                                 Time
                  Run1            0.365          29600           5
                  Run2            0.3621         27222           5
                  Run3            0.391          29414           5
                  Run4            0.3477         28254           5
                  Run5            0.3245         29315           5


   Table 2
   Performance metrics for Simulated Annealing
                  Run             ndcg@10        Annealing       No. of Features
                                                 Time
                  Run1            0.4024         224106          5
                  Run2            0.3032         163964          5
                  Run3            0.4249         143983          5
                  Run4            0.4248         146676          5
                  Run5            0.4205         144050          5


from the 46 features, which are more relevant for future problem solving. The performance metrics of
Quantum Annealing are shown in Table 1 and Simulated Annealing is shown in Table 2. The result
shows the performance of Simulated Annealing is better when compared with Quantum Annealing
in terms of ndcg@10 and Annealing Time. The mutual information between the feature and target is
visualized according to rank in Figure 1.


6. Conclusions and Future Work
The shared task, Task 1A by QCLEF2024 intended to perform feature selection from MQ2007_train
dataset. The most prominent 5 features are selected from the available 46 features. This implementation
is carried out by simulated Annealing and Quantum Annealing and we observed that the annealing
time of the Quantum model is less when compared to Simulated Annealing.
References
 [1] V. R, H. V, Quantum computing in machine learning - the future of quantum computing, Interna-
     tional Journal of Advanced Research in Science, Communication and Technology (2024) 311–314.
     doi:10.48175/IJARSCT-15955.
 [2] M. Nivelkar, S. G. Bhirud, Optimized machine learning: Training and classification performance
     using quantum computing, in: 2021 IEEE 6th International Conference on Computing, Communi-
     cation and Automation (ICCCA), 2021, pp. 8–13. doi:10.1109/ICCCA52192.2021.9666429.
 [3] H. Wang, A novel feature selection method based on quantum support vector machine, 2023.
     arXiv:2311.17646.
 [4] A. Poggiali, A. Bernasconi, A. Berti, G. Del corso, R. Guidotti, Quantum feature selection with
     variance estimation, 2023, pp. 245–250. doi:10.14428/esann/2023.ES2023-99.
 [5] G. Hellstern, V. Dehn, M. Zaefferer, Quantum computer based feature selection in machine
     learning, IET Quantum Communication n/a (????). URL: https://ietresearch.onlinelibrary.
     wiley.com/doi/abs/10.1049/qtc2.12086.          doi:https://doi.org/10.1049/qtc2.12086.
     arXiv:https://ietresearch.onlinelibrary.wiley.com/doi/pdf/10.1049/qtc2.12086.
 [6] R. Bhagawati, T. Subramanian, An approach of a quantum-inspired document ranking algorithm
     by using feature selection methodology, International Journal of Information Technology 15 (2023)
     4041–4053.
 [7] A. Pasin, M. Ferrari Dacrema, P. Cremonesi, N. Ferro, QuantumCLEF 2024: Overview of the
     Quantum Computing Challenge for Information Retrieval and Recommender Systems at CLEF,
     in: Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2024), Grenoble,
     France, September 9th to 12th, 2024, 2024.
 [8] T. Qin, T.-Y. Liu, Introducing letor 4.0 datasets, 2013. arXiv:1306.2597.
 [9] T. Qin, T.-Y. Liu, J. Xu, H. Li, Letor: A benchmark collection for research on learning to rank for
     information retrieval, Information Retrieval 13 (2010) 346–374.
[10] A. Pasin, M. Ferrari Dacrema, P. Cremonesi, N. Ferro, Overview of QuantumCLEF 2024: The
     Quantum Computing Challenge for Information Retrieval and Recommender Systems at CLEF, in:
     Experimental IR Meets Multilinguality, Multimodality, and Interaction - 15th International Confer-
     ence of the CLEF Association, CLEF 2024, Grenoble, France, September 9-12, 2024, Proceedings,
     2024.