<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Comparative Analysis of Classi cation Techniques for Cervical Cancer Utilising At Risk Factors and Screening Test Results</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sean Quinlan</string-name>
          <email>sean.a.quinlan@mycit.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Haithem A i</string-name>
          <email>haithem.afli@cit.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ruairi O'Reilly</string-name>
          <email>ruairi.oreilly@cit.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Cork Institute of Technology</institution>
          ,
          <country country="IE">Ireland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Cervical cancer is a severe concern for women's health. Every year in the Republic of Ireland, approximately 300 women are diagnosed with cervical cancer, 30% for whom the diagnosis will prove fatal. It is the second most common cause of death due to cancer in women aged 25 to 39 years [14]. Recently there has been a series of controversies concerning the mishandling of results from cervical screening tests, delays in processing said tests and the recalling of individuals to retake tests [12]. The serious nature of the prognosis highlights the importance and need for the timely processing and analysis of data related to screenings. This work presents a comparative analysis of several classi cation techniques used for the automated analysis of known risk factors and screening tests with the aim of predicting cervical cancer outcomes via a Biopsy result. These techniques encompass methods such as tree-based, clusterbased, liner and ensemble techniques, and where applicable use parameter tuning to determine optimal model parameters. The dataset utilised for training and validation consists of 858 observations and 36 variables, including the binary target variable \Biopsy". The data itself is heavily imbalanced with 803 negative and 55 positive observations with approximately 11.73% of the data points missing. These issues are addressed during pre-processing by methods such as mean or median imputation, as well as over-sampling, under-sampling and combination techniques which led to the creation of 6 augmented datasets of varying size, consisting of 34 variables including the response Biopsy. The results show that a SMOTE-Tomek combination resampling method in conjunction with a tuned Random Forest model produced an accuracy score of 99.69% with a recall and precision value of 0.99% for both positive and negative responses.</p>
      </abstract>
      <kwd-group>
        <kwd>Machine Learning</kwd>
        <kwd>Classi cation Techniques</kwd>
        <kwd>Cervical Cancer</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Cervical cancer is a disease in which healthy cells on the surface of the cervix
grow out of control forming a mass of cells called a tumour, which can then spread
to other regions of the body. After breast cancer, it is the second most common
cancer among women worldwide [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], and is also one of the most preventable
cancers with 90% of cases identi able and treatable in its early stages [28].
      </p>
      <p>According to the World Health Organisation, comprehensive cervical cancer
control includes primary prevention (vaccination against HPV), secondary
prevention (screening and treatment of pre-cancerous lesions), tertiary prevention
(diagnosis and treatment of invasive cervical cancer) and palliative care [30]. It
is at the secondary screening phase that this analysis is to be employed.</p>
      <p>Diagnosing cervical cancer requires several physical tests, such as a HPV
test, smear test, or colposcopy. This process can take a minimum of 4 weeks for
results to return, and during the high demand period results took up to 33 weeks
to be returned [13].</p>
      <p>
        The use of classi cation techniques can provide an informed initial indication
of at-risk individuals enabling their tests to be expedited and medical
intervention employed at an earlier stage. This is especially useful during periods of
high-volume testing such as those seen in Ireland in recent times [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], as delays
in diagnosis of cervical cancer are one of the main reasons for increased fatalities
despite the availability of advanced medical facilities [17]. Similarly, this method
has the potential to be of value in low-resource settings as only an individual's
risk factor information is needed to perform an initial screening.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>A woman's risk of developing cervical cancer is a ected by several factors, some
of which are intrinsic such as genetics and age, others such as smoking habits,
methods of contraceptives, and diet are modi able. An implication of which is
that individuals can take actions to reduce the impact of known risk factors.
This work aims to analyse these known risk factors, the majority of which are
modi able to determine the outcome of a patient's classi cation regarding
cervical cancer based on biopsy results. The following studies have shown that these
risk factors are signi cant in the development of cervical cancer.</p>
      <p>
        Manderson et al. [19] showed that bearing several children has been found
to contribute to increased risk of cervical cancer. In an Australian study, Xu
et al. [32] found that hormonal contraceptives and smoking contribute to the
development of cervical cancer, while a study by Shukla et al. [26] showed long
term use of contraceptive pills might lead to breast and cervical cancer. Averbach
et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] highlighted the contribution of IUD contraceptives in the development
of cervical cancer, a similar study by Rousset-Jablonski et al. [23] focused on
IUD regarding the pelvic in ammatory disease which can further contribute to
cervical cancer. Age being an intrinsic feature has been shown by Teame et al. [27]
to contribute to the risk of a patient's development of cervical cancer. Eldridge
et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] concluded that smoking leads to cervical cancer by increasing the risk of
Human Papillomavirus Infection (HPV). Sexually transmitted diseases (STDs)
have been shown to also lead to an increased risk of HPV and cervical cancer by
Parthensis et al. [21], while a somewhat common sense nding by Santelli et al.
[24] in that patients having multiple sexual partners increase the risk of STDs
which in turn leads to a greater risk of developing cervical cancer. Per the Irish
Cancer Society 2017 Review [15] HPV has been shown to be a large contributor
to the development of cervical cancer, while also highlighting a steep decline
(87% down to 50%) over a two-year period prior to the review in the numbers
receiving the vaccination due to social media misinformation { this stresses the
importance of clear, informed, and available information.
      </p>
      <p>
        Bosch et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] used linear logistic regression to study the relationship
between cervical cancer, HPV, aspects of sexual and reproductive behaviour, oral
contraceptives and smoking habits of patients. Finding that HPV was the biggest
risk factor in determining occurrences of cervical cancer. The National Cancer
Registry Ireland (NCRI) also cites these factors as being leading contributors to
the development of cervical cancer [20]. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] also notes a signi cant increase in
risk for those in low education areas. This increase is also noted by the WHO
[30] regarding higher rates of cervical cancer in developing countries.
      </p>
      <p>The advent of big data has seen increased interest in automated solutions
for analytical processes. In the context of healthcare, this has resulted in a
transition in clinical practice whereby practitioners are encouraged to incorporate
technology-based solutions if increased e ciencies, transparency or cost
reductions can be achieved by doing so. This transition is materialising itself in the
form of advanced arti cial intelligence and machine learning-based techniques
in areas such as automated decision making, treatment plans and supervision of
patients.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Methodology</title>
      <p>This research utilises classi cation techniques and patient data consisting of
known risk factors such as age, the number of pregnancies, STD's, and smoking
habits with the intent of developing predictive models to accurately classify a
patient's diagnosis of cervical cancer based on biopsy results. The analysis seeks
to assess the dataset via several supervised classi cation models encompassing
areas such as tree, cluster, linear and ensemble technique, and where applicable
apply parameter tuning to determine the optimal prediction parameters for each
model. Each model is then compared to determine an overall optimal method
for predicting the diagnosis of cervical cancer based on the Biopsy classi cation.</p>
      <p>
        The dataset used in this analysis is the \Cervical Cancer Risk Factors"
dataset available from the UCI data repository [16]. This dataset originated
from \Hospital Universitario de Caracas' in Caracas, Venezuela and is derived
from historical medical records of 858 patients with a Biopsy count of 803
Negative to 55 Positive observation [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Similar work has previously been carried out
on this dataset, the ndings of two such papers are as follows. Alwesabi et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
have previously analysed this dataset regarding classi cation and feature
selection, nding that a decision tree classi er yielded the best results predicting the
target \Biopsy" with an accuracy of 97.5%. W. Wu and H. Zhou [31] performed
feature selection with PCA and used three methods of Support Vector
Machine to analyse the dataset: Standard SVM, support vector machine recursive
feature elimination and support vector machine-principal component analysis.
Their standard SVM model produced an accuracy of 94.13 % in predicting the
response variable\Biopsy", with 100% sensitivity and 90.21% speci city.
      </p>
      <p>The approach taken in this paper can be di erentiated from those mentioned
previously in that they have either removed 3 of the 4 response variables
(\Hinselmann", \Schiller" and \Cytology") leaving only \Biopsy" as the target or
have carried out separate analyses with each of the 4 responses as a target and
excluded the other 3.</p>
      <p>
        This analysis proposes to include \Hinselmann", \Schiller" and \Cytology"
as features leaving \Biopsy" as the single response. The rationale for this is
that each of those variables is the result of a test carried out to determine the
presence of abnormal cells [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] [25] [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Therefore they can be used as features to
contribute to the outcome of a biopsy result and the presence of cervical cancer.
3.1
      </p>
      <p>
        Implementation
The analysis was carried out using Python, with the loading/summarising of
data achieved via NumPy/Pandas, while visualisations were achieved via
graphical packages Seaborn and Matplotlib. The pre-processing, model building and
evaluation were carried out via the Scikit-learn package, which encompasses a
wide range of state-of-the-art machine learning algorithms [22]. To avoid the
\Reproducibility Crisis" [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], where applicable, a global integer variable was
created and assigned to the random state parameter for each method.
      </p>
      <p>This analysis followed the Cross-Industry Standard Process for Data Mining
(CRISP-DM) process [29], which provides a formal standardised framework of 6
cyclical steps for planning and implementing data mining.
1. Business Understanding { Achieved through the related work, introduction
and evaluation sections.
2. Data Understanding { The related work showed that the dataset features
were suitable for this analysis, and exploratory data analysis gave further
insight into the data.
3. Data Preparation { Built on from step 2 and achieved through pre-processing
tasks such as missing value imputation, dealing with outliers, class imbalance
and train/test splitting.
4. Modelling { Building the models and applying parameter tuning.
5. Evaluation { Comparing the models' results to determine the optimal model.
6. Deployment { Releasing the model to the production environment.</p>
      <p>Data preparation involved processing the data with regards to outlier
detection, handling missing values via mean/median imputation, and dealing with
imbalance using over, under and combination resampling techniques.</p>
      <p>The removal of outliers should be considered in the context of the e ect
their removal would have on analysis. To manipulate the outliers, for instance,
replace them with mean/median values or remove observations, could negatively
impact the accuracy of the models either by the reduction in sample size or by
the narrowing of values the models could accurately account for. As such, it was
decided that potential outliers should be included.</p>
      <p>
        Missing data can occur for several reasons, be it di culties encountered
during an experiment, errors during data collection or entry, or a systemic omission
of answers by respondents. The latter occurs here, with respondents choosing not
to answer certain questions due to privacy concerns [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Missing data rates of less
than 1% are generally considered trivial, and those between 1-5% are
manageable. However, 5-15% requires imputation techniques to handle, and more than
15% may severely impact any kind of interpretation or conclusions [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>The dataset has a total possible 30,888 (858 x 36) available data points. Of
these, 3,622 or 11.73% data points have missing values, while 27,266 data points
are populated. Figure 1 shows the extent of missing data. Note, that only 26
variables are shown as 10 variables had no missing data.</p>
      <p>Removing observations where missing data occurs will reduce the sample
size and in turn, reduce the accuracy of any predictive models, it can also bias
the data making any conclusions drawn not truly representative of the
population. As such, it is typically preferable to use imputation techniques to estimate
the missing values rather than remove observations. Imputation is the process
of estimating a missing value based on valid values of other variables and/or
subjects/observations in the sample.</p>
      <p>A dataset is unbalanced when at least one class is represented by only a small
number of training examples while other classes make up the majority. This
imbalance gives rise to the class imbalance problem [18], which occurs when the
majority class(s) observations greatly outnumber that of the minority class(s)
observations in a machine learning problem. Here, the response variable Biopsy
has an imbalance of 803 negatives observations to 55 positives observations.</p>
      <p>Imbalanced-learn is a python package that o ers several resampling
techniques that solve this Class Imbalance problem. From this package 6 methods,
2 from each category of over-sampling, under-sampling and combination
techniques were used. This led to the creation of 6 augmented datasets of varying
size, consisting of 34 features, including the response Biopsy. Table 1 shows the
method used, the number of observations and count of the target variable Biopsy
in the newly augmented datasets .</p>
      <p>For each augmenting method used, a new dataset was created, each of which
along with the original pre-processed dataset were shu ed and split into train
and test sets (80/20 split) via the Scikit-learn model selection module. Following
this, 7 lists were created to hold the respective split data from each dataset; this
enabled the values to be accessed globally from the function. It should be noted
that some augmenting methods produce oat values, where bool/int values are
required, these were converted/rounded to the desired format.</p>
      <p>Following the previously outlined pre-processing steps, the building of the
models from the training sets was carried out, and the test sets were then
evaluated. This process is associated with steps 4 and 5 of CRISP-DM. Scikit-learn
provides several modules and methods to accomplish this. Where applicable
the random state for each model was set to 3 for reproducibility, and
hyperparameter optimisation techniques to nd the optimal values for each model
were employed.</p>
      <p>Models 1 &amp; 2: Decision Trees are a non-parametric supervised learning
technique. For a classi cation tree, predictions of each observation are made by
the most commonly occurring class of training observations in the region to which
it belongs. This is achieved through recursive binary splitting { a greedy (better
split now rather than later) top-down method that splits the nodes (variable)
into two branches moving down at each split towards a leaf decision node which
represents the response. Here, the DecisionTreeClassi er method from the Tree
module was used. It employes an optimised version of the CART algorithm. With
this, two models were created, Model 1 which has it's criterion set to \entropy"
and Model 2 where it is set to \gini".</p>
      <p>Model 3: Naive Bayes methods are a set of supervised learning algorithms
based on applying Bayes' theorem with the assumption that features are
independent of one another. The GaussianNB method form the naive bayes module
was used. This method assumes the data follows a normal distribution.</p>
      <p>Model 4: Gradient Boosting is a machine learning technique that combines
several weak learners, typically decisions trees to form a model. The
GradientBoostingClassi er method was implemented via the ensemble module. It has
several tuning parameters, n estimators - the number of boosting stages to
perform, which was set to 100, learning rate - shrinks the contribution of each tree,
which was set to 1, and max depth - maximum depth of the individual regression
estimators, which was set to 2.</p>
      <p>Model 5: K-means clustering is the most widely used unsupervised learning
technique. It seeks to partition a dataset into K (speci ed by the user) distinct,
non-overlapping clusters. Implemented via the KMeans method from the cluster
module. The n clusters parameter - the number of clusters and centroids to
generate, was set to 2 when tuning this model.</p>
      <p>Model 6: K Nearest Neighbours is a non-parametric method used for
classi cation and regression analysis. KNN is sensitive to imbalanced datasets, a
point to note in relation to this analysis. If the value for K is too small then it
becomes susceptible to noise, if too large it becomes susceptible to bias.
Typically when choosing K the square root of the number of samples in the training
set is used. The KNeighborsClassi er method from the neighbors module was
used to implement KNN. When tuning this model the distance method was set
to 2 for euclidean distance, and the value of K was determined by tuning the
n neighbors parameter as seen in Figure 2 on one of the augmented datasets.</p>
      <p>Model 7: Linear Discriminant Analysis is a classi cation technique that
uses a linear decision boundary, created by tting class conditional densities to
a dataset and using Bayes' rule, it assumes a normal distribution. It is
implemented here through the use of the LinearDiscriminantAnalysis method from
the discriminant analysis module. When tuning this model, the solver was set
to \svd"- Singular value decomposition.</p>
      <p>Model 8: Logistic Regression is a classi cation algorithm typically used in
binary classi cation problems, such as the case here with negative, 0 and positive,
1 response values. In the logistic model, the log-odds (the logarithm of the odds)
for the value "1" is a linear combination of one or more independent features.
The LogisticRegression method from the linear model module was used, with
the solver parameter set to \liblinear".</p>
      <p>Model 9: Random Forests are an ensemble learning method that construct
numerous decision trees during data training, outputting the class that is the
mode of the classes for classi cation of the individual trees. Random Forests
correct for a decision trees' habit of over tting to their training set. The
RandomForestClassi er() method from the ensemble module was used for this
analyses. Parameters tuned to optimise this model were max features which is the
maximum number of variables RF can test in each node, and the n estimators
parameter, which is the number of trees that are built before the average is
taken.</p>
      <p>(L)</p>
      <p>(R)</p>
      <p>Model 10: Support Vector Machines (SVM) nd a boundary known as a
hyperplane in an N-dimensional space that classi es the data points into discrete
categories depending on which side of the boundary they lie. Here the svm
method was imported through svm module. SVC is a form of SVM for dealing
with classi cation analyses.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>Many classi cation algorithms aim to minimise the error rate and obtain a higher
accuracy result. They assume that the cost of all misclassi cation errors is equal.
This approach can be problematic, particularly in relation to the area of health.</p>
      <p>
        If a positive result indicates the presence of cancer, and a negative result
indicates it's absence, then the consequences of classifying a patient as negative
when in fact they are positive - False Negative, is more severe than classifying
the patient as positive when they are in fact negative - False Positive [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>A more accurate metric to use is sensitivity, also known as the True Positive
(TP) Rate. This is the proportion of people that tested positive and actually
are positive. It can be considered the probability that the test is positive, given
that the patient is ill. With higher sensitivity, fewer actual cases of disease go
undetected, or in the case of the cancer models, fewer patients that have cancer
go undetected. Speci city (TN) is the opposite of this.</p>
      <p>The Scikit-learn metric module provides the functionality to produce a
classi cation report which includes values such as Precision, Recall and F1-score, as
well a confusion matrix via the accuracy score, classi cation report, and
confusion matrix methods. A description of these metrics can be seen in Table 2.</p>
      <p>Table 3 denotes the accuracy, precision, recall, and F1 results of the original
cleaned dataset, and the 6 resampled datasets, consisting of 2 over, under, and
combination sampled datasets. The legends for the models and databases are
denoted on the right hand side of the table.</p>
      <p>When taking accuracy as a metric, Table 3 shows that the Naive Bayes
model was consistently a poor performer across the 7 datasets, scoring results
as low as 9.88% and 12.41% in the original and NCR undersampled datasets
respectively. In comparison, both Decision Tree models scored above 90% in all
models except for the NCR undersampled dataset. The Random Forest model
scored the highest getting above 90% for each dataset.</p>
      <p>When viewing the original cleaned dataset (OC) it can be seen that several
models failed to predict any of the positive cases correctly. The LDA model
had an accuracy of 94.19% and correctly predicted 9 of the 11 positive cases
yielding a recall of 82%. The Random Forest model also had an accuracy of
94.19%, however it only had a recall of 55% or predicted 6 of the 11 positive
observations.</p>
      <p>The Random Over Sampled dataset (ROS) shows that the 3 tree models all
produced an accuracy result greater than 98%, with all 3 having a recall of 100%
for the positive diagnosis observations.</p>
      <p>When viewing the Adaptive Synthetic Sampling Over{Sampled dataset (ASS),
it can again be seen that the 3 tree models perform well with an accuracy greater
than 98%. They also produce a precision and recall result of 99% for both
positive and negative outcomes. The Random Under Sampled dataset (RUS) shows
that the Gini Decision Tree model as well as the Linear Discriminant Analysis
model perform very well, with an accuracy of 95.45% and both precision and
recall for positive and negative observations above 90% in both models.</p>
      <p>When viewing the Neighbourhood Cleaning Rule dataset (NCR), it can be
seen that 8 of the models produce an accuracy of above 90%, however from these
8 models only 2 (LDA &amp; LR) produce a positive recall value greater than 70%.
This again highlights the caution needed when using accuracy as a metric with
imbalanced data.</p>
      <p>The SMOTE-Tomek combination sampled dataset (S-TOM) produces the
model with the most promising results in this analysis. The Random Forest
model generates an accuracy of 99.69% with both positive and negative precision
and recall values almost being 100%, and an F1 result of 1 for both positive and
negative outcomes. Here the KNN model also does well when compared to it's
performance in the other datasets.</p>
      <p>When viewing the Smote ENN combination sampled dataset (S-ENN), it
can be seen that again the three tree methods perform well with high recall
and precision results for both positive and negative outcomes. In 5 of the 7
datasets, the Naive Bayes model assigns the majority of observations to the
positive category, resulting in its poor overall performance, but high positive
recall results.</p>
      <p>Accuracy Models Legend</p>
      <p>DT-E DT-G GNB GB KM KNN LDA LR RF SVC Model
OC 93.02 91.28 9.88 93.6 42.44 93.6 94.19 93.6 94.19 93.6 Decision Tree (Entropy)
ROS 98.14 98.45 53.42 90.99 50.62 95.65 91.93 91.93 98.76 81.99 Decision Tree (Gini)
ASS 98.77 98.46 50.93 85.19 55.25 93.21 95.06 95.99 99.38 89.51 Gaussian Naive Bayes
RUS 81.82 95.45 63.64 90.91 45.45 72.73 95.45 86.36 90.91 50 Gradient Boosting
NCR 90.34 90.34 12.41 92.41 60 91.72 93.1 94.48 93.1 92.41 K-Means
S-TOM 97.81 98.75 55.94 87.5 54.69 95.62 69.25 94.69 99.69 89.69 K-Nearest Neighbour
S-ENN 95.45 96.85 76.57 81.82 46.85 97.2 93.01 92.32 98.6 83.92 Linear Discriminant Analysis</p>
      <p>Precision Logistic Regression
OC 10 00..4956 00..3936 0.107 0.094 00..0974 0.094 00..5939 00..957 00..5957 0.094 SuppRoratnVdoecmtoFroCrelastssi er
ROS 01 0.197 0.197 0.153 00..8967 00..4594 0.192 00..8995 00..8995 0.198 00..7945
ASS 10 00..9999 00..9989 01.5 00..7979 00..5555 00..8989 00..9983 00..9975 00..9999 00..9855
RUS 10 00..883 0.192 00..8536 0.183 00..453 00..8684 0.191 0.177 00..992 0.148
NCR 10 00..3985 00..946 0.108 0.092 00..0972 0.092 00..5938 00..6928 00..5947 0.092
S-TOM 01 00..9978 00..9998 0.155 00..9882 00..5611 00..9993 00..9939 00..9928 0.199 00..8945
S-ENN 01 00..9984 00..9985 00..974 00..9775 00..4429 0.195 00..9915 00..9914 0.197 00..8817</p>
      <p>Recall
OC 01 00..9465 00..9356 0.104 10 00..4614 10 00..9852 00..9565 00..9575 10
ROS 10 0.196 0.197 0.103 00..8957 00..3649 0.191 00..8995 00..8995 0.197 00..9966
ASS 10 00..9999 00..9998 0.104 00..9792 00..4628 00..9897 00..9928 00..9957 00..9999 00..8936
RUS 01 00..883 01.9 00..492 0.183 00..363 00..598 0.192 0.175 00..992 0.108
NCR 01 00..9445 00..9535 0.105 10 00..6326 0.099 00..9753 00..9763 00..9664 10
S-TOM 10 00..9988 00..9989 0.105 00..9794 00..4656 00..9991 00..9949 00..9927 0.199 00..8955
S-ENN 01 00..9938 00..9959 00..5927 00..6948 00..3611 0.194 00..9951 00..9931 0.197 00..8871</p>
      <p>F1-Score
OC 10 00..4956 00..3955 00..1027 0.097 00..1527 0.097 00..6947 00..5927 00..5957 0.097
ROS 10 00..9988 00..9998 00..6096 00..9911 00..4517 00..9965 00..9922 00..9922 00..9999 00..884
ASS 01 00..9999 00..9988 00..0687 00..8837 00..6418 00..9933 00..9955 00..9966 00..9999 00..899
RUS 01 00..983 00..9956 00..6596 00..9911 00..54 00..775 00..9956 00..8876 00..992 00..6155
NCR 10 00..4925 00..4965 00..115 0.096 00..1724 0.096 00..6925 00..6977 00..5986 0.096
S-TOM 10 00..9988 00..9999 00..7019 00..8895 00..5527 00..9965 00..9966 00..9954 11 00..99
S-ENN 01 00..9956 00..9977 00..6891 00..7875 00..3565 00..9977 00..9933 00..9923 00..9999 00..8844</p>
      <p>DT-E DT-G GNB GB KM KNN LDA LR RF SVC
Table 3. Results denoting the accuracy, precision, recall, and F1 of the models tested
on the six databases. Model and Database legends are denoted on the upper right hand.</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>This paper shows a comparison of classi cation techniques used for predicting
the outcome of biopsy results based on known risk factors and screening tests. It
also highlights the relevance and study of these known risk factors used in this
classi cation process.</p>
      <p>Pre-processing techniques were employed to address missing data and
imbalance, and where applicable parameter tuning was employed to nd optimal
values for models. It was shown that imbalanced data can in uence the outcome
of predictive models, highlighting the need to pre-processing techniques to
address said issue. It also showed that accuracy is not an acceptable measure for
imbalanced data, and in particular health data.</p>
      <p>From the models tested, the Random Forest model was shown to be superior
at predicting the biopsy response, yielding high accuracy, precision and recall
values, while the Gauissian Nave Bayes model was the poorest predictor. The
combination resampling method SMOTE-Tomek's dataset, in conjunction with
a Random Forest model produced the highest result with an accuracy of 99.69%,
and a precision and recall of 99% for both negative and positive targets.
13. HSE: Cervicalcheck. https://www.hse.ie/eng/cervicalcheck (2019), [Online;
accessed 2019-10-11]
14. HSE: Cervicalcheck: Screening information. https://
www.hse.ie/eng/cervicalcheck/screening-information/
why-you-are-offered-a-free-cervical-screening-test/cervical-cancer.
html (2019), [Online; accessed 2019-10-11]
15. Irish Cancer Society: Irish cancer society annual report 2017. https:
//www.cancer.ie/about-us/who-we-are/annual-reports-accounts#sthash.
8McZayy5.dpbs (2019), [Online; accessed 2019-10-11]
16. Kelwin Fernandes, J.S.C., Fernandes, J.: Transfer learning with partial
observability applied to cervical cancer screening. https://archive.ics.uci.edu/ml/
datasets/Cervical+cancer+\%28Risk+Factors\%29, [accessed 2019-10-11]
17. Koh, W.J., et al: Cervical cancer, version 2.2015. Journal of the National
Comprehensive Cancer Network : JNCCN 13, 395{404 (04 2015)
18. Lema^tre, G., Nogueira, F., Aridas, C.: Imbalanced-learn: A python toolbox to
tackle the curse of imbalanced datasets in machine learning 18 (09 2016)
19. Manderson, L., Markovic, M., Quinn, M.: Like roulette: Australian women's
explanations of gynecological cancers. Social science &amp; medicine (1982) (2005)
20. NCRI: Cervical cancer trends. https://www.ncri.ie/sites/ncri/files/pubs/</p>
      <p>CervicalCaTrendsReport_35.pdf (2019), [Online; accessed 2019-10-11]
21. Parthenis, C., Panagopoulos, P., et al: The association between sexually
transmitted infections, human papillomavirus and cervical cytology abnormalities among
women in greece. International Journal of Infectious Diseases 73 (06 2018)
22. Pedregosa, F., Varoquaux, G., et al: Scikit-learn: Machine learning in python.</p>
      <p>Journal of Machine Learning Research 12 (01 2012)
23. Rousset-Jablonski, C., Reynaud, Q., Nove-Josserand, R., Durupt, S., Durieu, I.:
Gynecological management and follow-up in women with cystic brosis. Revue des
maladies respiratoires 35(6), 592|603 (June 2018)
24. Santelli, J., Brener, N., Lowry, R., Bhatt, A., Zabin, L.: Multiple sexual partners
among u.s. adolescents and young adults. Perspectives on Sexual and Reproductive
Health 30(6), 271{275 (11 1998)
25. Sesti, F., Ticconi, C., Santis, L.D., Piccione, E.: Clinical value of schiller's test in
colposcopic examination of the uterine cervix. Journal of Obstetrics and
Gynaecology 10(6), 545{547 (1990)
26. Shukla, A., Jamwal, R.: Adverse e ect of combined oral contraceptive pills. Asian</p>
      <p>Journal of Pharmaceutical and Clinical Research Volume 10, 17{21 (01 2017)
27. Teame, H., et al: Factors associated with cervical precancerous lesions among
women screened for cervical cancer in addis ababa, ethiopia (2018)
28. Walsh, J., O'Reilly, M., Treacy, F.: Factors a ecting attendance for a cervical smear
test: A prospective study. irish cervical screening programme and the national
university of ireland, galway
29. Wirth, R., Hipp, J.: Crisp-dm: Towards a standard process model for data mining.</p>
      <p>Journal of Machine Learning Research (2000)
30. World Health Organisation: Hpv and cervical cancer. https://www.who.
int/en/news-room/fact-sheets/detail/human-papillomavirus-(hpv)
-and-cervical-cancer (2019), [Online; accessed 2019-10-11]
31. Wu, W., Zhou, H.: Data-driven diagnosis of cervical cancer with support vector
machine-based approaches. IEEE Access 5, 25189{25195 (2017)
32. Xu, H., et al: "hormonal contraceptive use and smoking as risk factors for
highgrade cervical intraepithelial neoplasia in unvaccinated women aged 30{44 years:
A case-control study in new south wales, australia, cancer epidemiology (2018)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Alwesabi</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Choudhury</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Classi cation of cervical cancer dataset (</article-title>
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Averbach</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , et al:
          <article-title>Recent intrauterine device use and the risk of precancerous cervical lesions and cervical cancer</article-title>
          .
          <source>Contraception</source>
          <volume>98</volume>
          (04
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Baker</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Is there a reproducibility crisis?</article-title>
          <source>Journal of Machine Learning Research</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Bosch</surname>
            ,
            <given-names>F.X.</given-names>
          </string-name>
          , et al:
          <article-title>Risk factors for cervical cancer in colombia and spain</article-title>
          .
          <source>International journal of cancer 52 5</source>
          ,
          <issue>750</issue>
          {8 (
          <year>1992</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Dillner</surname>
            ,
            <given-names>J.,</given-names>
          </string-name>
          <article-title>et al: Long term predictive values of cytology and human papillomavirus testing in cervical cancer screening: joint european cohort study</article-title>
          .
          <source>BMJ</source>
          <volume>337</volume>
          (
          <year>2008</year>
          ), https://www.bmj.com/content/337/bmj.a1754
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Eldridge</surname>
            ,
            <given-names>R.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pawlita</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , Wilson,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Castle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.E.</given-names>
            ,
            <surname>Waterboer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Gravitt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.E.</given-names>
            ,
            <surname>Schi</surname>
          </string-name>
          <string-name>
            <given-names>man</given-names>
            , M.,
            <surname>Wentzensen</surname>
          </string-name>
          , N.:
          <article-title>Smoking and subsequent human papillomavirus infection: a mediation analysis</article-title>
          .
          <source>Annals of Epidemiology</source>
          <volume>27</volume>
          (
          <issue>11</issue>
          ),
          <volume>724</volume>
          {
          <fpage>730</fpage>
          .
          <string-name>
            <surname>e1</surname>
          </string-name>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Eraso</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Migrating techniques, multiplying diagnoses: the contribution of Argentina and Brazil to early 'detection policy' in cervical cancer 17 (</article-title>
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Farhangfar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kurgan</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dy</surname>
          </string-name>
          , J.:
          <article-title>Impact of imputation of missing values on classi cation error for discrete data</article-title>
          .
          <source>Pattern Recognition</source>
          <volume>41</volume>
          (
          <issue>12</issue>
          ),
          <volume>3692</volume>
          {
          <fpage>3705</fpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Fernandes</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cardoso</surname>
            ,
            <given-names>J.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fernandes</surname>
          </string-name>
          , J.:
          <article-title>Transfer learning with partial observability applied to cervical cancer screening</article-title>
          .
          <source>In: IbPRIA</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Ganganwar</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>An overview of classi cation algorithms for imbalanced datasets</article-title>
          .
          <source>International Journal of Emerging Technology and Advanced Engineering</source>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Greenlee</surname>
          </string-name>
          , R.T.,
          <string-name>
            <surname>Murray</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bolden</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wingo</surname>
            ,
            <given-names>P.A.</given-names>
          </string-name>
          : Cancer statistics,
          <year>2000</year>
          . CA: A
          <source>Cancer Journal for Clinicians</source>
          <volume>50</volume>
          (
          <issue>1</issue>
          ),
          <volume>7</volume>
          {
          <fpage>33</fpage>
          (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. HSE: Cervicalcheck. http://www.cervicalcheck.ie/news-and
          <article-title>-events/ information-for-healthcare-professionals-from-cervicalcheck-latest-update</article-title>
          .
          <volume>14910</volume>
          .
          <string-name>
            <surname>html</surname>
          </string-name>
          (
          <year>2019</year>
          ), [Online; accessed 2019-
          <volume>10</volume>
          -11]
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>