<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>OptiClust4Rec: Unsupervised Data-Driven Methodology for Quality of Life Recommendations During a Medical Therapy (Extended Abstract)</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Juba Agoun</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yanis Bouallouche</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mohand-Saïd Hacid</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Université Claude Bernard Lyon 1</institution>
          ,
          <addr-line>CNRS, LIRIS, UMR5205, 69100 Villeurbanne</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Université de Montpellier</institution>
          ,
          <addr-line>34095 Montpellier</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Upon the introduction of novel medical therapies, an array of semantically diferent data is gathered from the participant's cohort. Unsupervised learning is always privileged as a preliminary step for data investigation, to extract valuable information before embarking on the tedious task of data labeling. Clustering is one of the techniques that provide a comprehensive overview for exploratory data analysis, aiding in the identification of patient communities. With OptClust4Rec, we provide a characterization of clusters, from which we can derive recommendations for patients undergoing therapy treatment. Our focus is on optimizing the clustering and the dimensionality reduction based on concise metrics and data topology analysis.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Clustering</kwd>
        <kwd>Optimization</kwd>
        <kwd>Data analysis</kwd>
        <kwd>Unsupervised learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The healthcare sector generates a substantial amount of data daily. New treatment therapies are
tested on patients during clinical trials. The acquired data are semantically diferent, covering
health status, side efects, work potential, and lifestyle. It is collected to unveil causal inferences
regarding treatment efectiveness. These data have therefore become predicting subsequent
future health conditions, reducing the cost of treatments, and improving the quality of life in
general. For instance, to improve patients’ immunotherapy treatment, the analysis of upstream
medical data enables the production of recommendations and guidelines for practitioners.</p>
      <p>In aiding medical decision-making, classical patient clustering proves to be a dependable
approach. This involves identifying dominant characteristics within patient groups and tracking
their health progression. Unsupervised techniques allow for an initial analysis of data
relationships, without necessitating specialized domain knowledge. While supervised methods have
shown their eficacy, building labeled datasets remains a time-consuming and labor-intensive
task. Hence, with OptiClust4Rec, we introduce an unsupervised data-driven methodology,
enabling the discovery of concealed patterns in clinical patient data alongside questionnaire
responses.</p>
      <p>OptiClust4Rec1 (Optimized Clustering for recommendation) introduces a user-friendly
webbased interface aimed at assessing clustering algorithm capabilities for specific datasets. This
evaluation employs statistical measures based on internal metrics, complemented by visual
exploration to analyze measure variations with diverse parameters and dimensionality reduction
techniques. The tool guides users through optimizing the clustering process, producing distinct
clusters, and then characterizing them by extracting essential features. To generate patient
recommendations, we utilize two datasets: one with patient analyses and information, and
another containing their questionnaire responses. OptiClust4Rec is designed to cluster patients
based on both sets of data and label the clusters, which, consequently, provides end-users with
association of rules and the subsequent production of recommendations.</p>
    </sec>
    <sec id="sec-2">
      <title>2. System overview</title>
      <p>In our approach, we consider two challenges. First, we propose a set of metrics and visualization
tools that will enable users to optimize data clustering. Second, leveraging the clustering results,
we will focus on each cluster to extract the variables that characterize it, essentially creating a
form of automatic labeling based on what is known as Salient Features.</p>
      <sec id="sec-2-1">
        <title>2.1. Clustering</title>
        <p>
          There are many well-understood techniques to draw upon; Centroid-based, Connectivity-based,
Grid-based, and Density-based. Choosing the right method with the right parameters for a
given dataset is performed following any deterministic method. In practice, to choose between
clustering methods, or to determine the number of clusters as input value, required testing
diferent algorithms. Indeed, researchers either select a default method ( e.g., k-means) with a
number of clusters well-known depending on domain knowledge, or subjectively choose the
most recent method available [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. Next, we will introduce the two optimizations we target with
our tool.
        </p>
        <sec id="sec-2-1-1">
          <title>2.1.1. Find well-adapted clustering algorithm</title>
          <p>
            Among the existing clustering algorithms, the challenge is to identify the appropriate model
for the data being analyzed. Our idea relies on the adoption of Persistent Homology, a
fundamental technique within the domain of Topological Data Analysis (TDA) [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ]. This approach
systematically examines the relationships among data points across various scales, providing
invaluable insights into the presence, geometric arrangement, and density of potential clusters.
The application of Persistent Homology results in a persistent diagram.
          </p>
          <p>In the persistent diagram, each distinct color represents a unique homology group. Specifically,
H0 represents the connected components, acting as a guide in estimating a potential cluster
1Visual summary of our approach through this illustration https://tinyurl.com/OptiClust4Rec-Figures
count. H1 points denote one-dimensional voids, while the H2 group outlines two-dimensional
voids, often referred to as cavities. The diagram intricately traces the horological changes as we
explore various scales of distance or proximity within the dataset. Each connected component
and void materializes as a point on the graph, with its inception (appearance) and cessation
(disappearance) plotted along the horizontal and vertical axes. A significant deviation of a
point from the diagonal indicates the enduring presence of the corresponding component,
potentially signifying robust and lasting structures within the data. Upon thorough examination
of the persistent diagram for a given dataset, the notable deviation of certain H1 and H2 points
from the diagonal strongly suggests the presence of nonlinear structures. These may involve
spherical forms or overlapping clusters. Consequently, this observation strongly supports
the recommendation of employing a density-based algorithm like DBSCAN, renowned for its
capability to perform exceptionally well in complex scenarios of this nature.</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>2.1.2. Find optimal k number of cluster</title>
          <p>Non-density-based clustering techniques require as an input the number of clusters to be shaped.
Thus, determining the exact or optimum number is a challenging task due to the absence of a
universal method for identifying the ideal number for a given dataset. The elbow method and
the Silhouette score are commonly used methods to find the optimal number of clusters.</p>
          <p>The elbow method assesses cluster compactness by computing the Within-Cluster Sum of
Squared (WCSS) for diferent cluster numbers, observing a decrease in WCSS as  increases,
indicating improved compactness. However, there comes a point where adding more clusters no
longer enhances quality. The point of this diminishing return, visually depicted as an elbow in
the WCSS evolution chart with respect to cluster numbers, indicates the optimal cluster count.
Nevertheless, it’s important to note that this method does not account for cluster separation, an
important aspect of ideal clustering.</p>
          <p>In contrast, the silhouette score provides a more comprehensive evaluation of cluster quality
by considering both cohesion (average distance within a cluster) and separation (average
distance to the nearest neighboring cluster) for each data point. However, it may encounter
challenges when identifying complex clusters with diverse shapes. It’s important to note that
in some instances, clusters identified by the silhouette score may exhibit uneven distributions,
potentially resulting in clusters with only a few observations while the rest are dispersed across
other clusters.</p>
          <p>
            We proceed to determine the optimal value of  by using two internal metrics: connectivity
and variability [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ]. The goal is to examine how variability changes with respect to connectivity.
With this pattern analysis, we aim to identify the characteristic point where variability decreases
significantly and connectivity slightly increases. This point is visualized as a knee on the graph
depicting the evolution of variability with respect to connectivity, and it signifies the optimal
number of clusters based on [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ]. Our approach is applied to seven diferent algorithms and is
rounded of with a voting mechanism. For example, if four out of the seven algorithms advocate
for four clusters as the ideal number, then four clusters will be considered optimal.
          </p>
          <p>
            Dealing with high-dimensional data presents challenges, emphasizing the need for
dimensionality reduction. The curse of dimensionality poses a significant problem, leading us to favor
UMAP over Principal Component Analysis (PCA) for its eficiency in preserving data structure
and topology. We explore various dimensionality reductions, considering a maximum of √
dimensions for  observations [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ]. For each reduction, we search for the optimal number of
clusters and examine variance in cluster numbers across dimensions, using lower variance as
an indicator of particular interest and influencing our selection of cluster count.
          </p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Characterizing clusters</title>
        <p>
          Given our primary objective of working with semantically various data to derive association
rules, it is important that our data be appropriately labeled. Following the application of
clustering, our aim is to distill the prominent characteristics of each cluster by extracting the
salient features. To accomplish this, we adopt the approach outlined in [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. In this step, the
observations are categorized into in-pattern and out-pattern records. With the analysis of
in-patterns and out-patterns within a given cluster, we can pinpoint the salient features and
discern whether the associated variables exhibit high or low values.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. OptiClust4Rec</title>
      <p>OptiClust4Rec 2 is a web-based application designed for the analysis and visualization of
both medical and nonmedical datasets. It primarily employs unsupervised techniques, mainly
clustering, and ofers guidance through the utilization of dimensionality reduction,
topological data analysis, and automated labeling methods. As the screenshots in figure 1 illustrate,
OptiClust4Rec has several user interfaces, mainly three, which we detail in the following
sections:
A. The area presents the results of diferent clustering operations in diferent dimensions. After
loading the datasets in another interface, the user obtains results compared with the most
well-known internal metrics of the literature.</p>
      <p>B. The area displays the result of the persistence homology, which provides information on
the presence of cavities. The more H1 and H2 points we find deviating from the diagonal,
the more we recommend the use of density-based methods.</p>
      <p>C. The area is intended to display the results of cluster characterization. Once the user selects
the appropriate clustering method, each cluster is labeled with the salient variables.</p>
      <p>Given the result of each dataset cluster characterization, the user can derive correlations
between the clusters of the diferent semantic datasets.</p>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgments</title>
      <p>This research is supported by the European Union’s Horizon 2020 research and innovation
program under grant agreement No 875171, project QUALITOP (Monitoring multidimensional
aspects of Quality of Life after cancer ImmunoTherapy - an Open smart digital Platform for
personalized prevention and patient management).
2Link to tool and video: https://tinyurl.com/Demo-OptiClust4Rec</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A. J.</given-names>
            <surname>Parker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Barnard</surname>
          </string-name>
          ,
          <article-title>Selecting appropriate clustering methods for materials science applications of machine learning</article-title>
          ,
          <source>Advanced Theory and Simulations</source>
          <volume>2</volume>
          (
          <year>2019</year>
          )
          <fpage>1900145</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>Wasserman</surname>
          </string-name>
          ,
          <article-title>Topological data analysis</article-title>
          ,
          <source>Annual Review of Statistics and Its Application</source>
          <volume>5</volume>
          (
          <year>2018</year>
          )
          <fpage>501</fpage>
          -
          <lpage>532</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Handl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Knowles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kell</surname>
          </string-name>
          ,
          <article-title>Bioinformatics computational cluster validation in postgenomic data analysis</article-title>
          ,
          <source>Bioinformatics</source>
          (Oxford, England)
          <volume>21</volume>
          (
          <year>2005</year>
          )
          <fpage>3201</fpage>
          -
          <lpage>12</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>V.</given-names>
            <surname>Satopaa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Albrecht</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Irwin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Raghavan</surname>
          </string-name>
          ,
          <article-title>Finding a "kneedle" in a haystack: Detecting knee points in system behavior</article-title>
          ,
          <source>in: 2011 31st International Conference on Distributed Computing Systems Workshops</source>
          ,
          <year>2011</year>
          , pp.
          <fpage>166</fpage>
          -
          <lpage>171</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Hua</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lowey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Suh</surname>
          </string-name>
          , E. Dougherty,
          <article-title>Optimal number of features as a function of sample size for various classification rules</article-title>
          ,
          <source>Bioinformatics</source>
          (Oxford, England)
          <volume>21</volume>
          (
          <year>2005</year>
          )
          <fpage>1509</fpage>
          -
          <lpage>15</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Khoie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Tabrizi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Khorasani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Marhamati</surname>
          </string-name>
          ,
          <article-title>A hospital recommendation system based on patient satisfaction survey</article-title>
          ,
          <source>Applied Sciences</source>
          <volume>7</volume>
          (
          <year>2017</year>
          )
          <fpage>966</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>