Challenges for Automation of Public Health Data Analysis
Ravi Shankar
Grenoble, France
rsps1001@gmail.com


                                Abstract
                                Advancements in Machine Learning and Data Science are not adequately reflected in how
                                public health data is handled today. There is a visible gap between the advances in computing
                                and medical sciences. In this position paper, we present an example of data science applied to
                                the automation of a repetitive process within a cervical cancer screening program. We discuss
                                the challenges for automating public health data and share our insights to elevate artificial
                                intelligence (AI) in public healthcare.
                                Keywords 1
                                Public health, Data analysis, Cancer research, Automation


1. Introduction                                                                                            a microscope. This is an eye-dependent process,
                                                                                                           therefore inter- and intra-variability is present,
                                                                                                           and external revision is often needed as part of
    More than 80% of the cervical cancer cases
                                                                                                           quality assurance (QA). The full process
and deaths in a year occur in low medium income
                                                                                                           including QA may not be affordable, particularly
countries (LMICs) where prevention and cervical
                                                                                                           in LMICs. Hence, AI contributes to eliminate
screening resources are limited [1][2]. Recent
                                                                                                           such variability while saving time and resources.
research studies have used machine learning
models to support the initial phase of screening for
detection of cancerous lesions using colposcopic
images or cervicography[3][4]. These techniques
require tech-savvy healthcare workers who are
very scarce per capita in these countries.
    We aim to build a user-friendly automation
that would allow medical experts to diagnose
cancerous tissues of the cervix in a short period of
time while reducing costs and technical
experience required. This idea will work by                                                                Figure 1: Project pipeline illustrating the
combining heath and AI researchers’ expertise                                                              automation of public health data analysis
and experiences.
                                                                                                           involving human reviewers who validate the
    The main problem we aim to address is
                                                                                                           Machine Learning model’s prediction results.
diagnosing biopsied women within a cervical
cancer program. Our motivation is driven by the
importance and time consumption of pathology                                                                   Figure 1 illustrates the example of a proposed
process (i.e., pathologists reading histological                                                           pipeline in which we aim to automate the steps
slides). In the pathology process, women testing                                                           from fetching of biopsy-based cervical data
positive on screening tests are referred to                                                                within a cervical cancer screening program. We
specialised examination (colposcopy) to collect                                                            then pre-process the fetched data, followed by
biopsy samples from the cervix and then                                                                    training our machine learning (ML) model to
haematoxylin and eosin (H&E) histological slides                                                           make two or three prediction sets (ensuring QA)
are prepared to be reviewed by pathologists using                                                          for human reviewers to validate, and finally
                                                                                                           generate the reports of the analysis. The current

Joint Proceedings of the ACM IUI Workshops 2022, March 2022,
Helsinki, Finland EMAIL: rsps1001@gmail.com (A. 1)
                            Copyright © 2022 for this paper by its authors. Use permitted under Creative
                            Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Wor
 Pr
    ks
     hop
  oceedi
       ngs
             ht
             I
              tp:
                //
                 ceur
                    -
             SSN1613-
                     ws
                      .or
                    0073
                        g

                            CEUR Workshop Proceedings (CEUR-WS.org)
process (highlighted in yellow in Figure 1)                4. Handling Large Datasets: As public
excludes these steps for automation (highlighted           health studies are ongoing processes which
in blue in Figure 1).                                      include participants on a rolling basis, they can
    While the system works with the current                result in large datasets during overall period of
process, the automation steps are currently done           the study (which might span several years). It
manually and repeatedly by a group of                      is crucial to prepare for handling the data in
pathologists and statisticians. As the ML model            batches for faster training of the model.
does not exist in the current process, the analysis        5. Cross         Validation      by     Experts:
reports are produced after 2-3 stages of reviews           Validation of results is a necessity with respect
involving multiple meetings to concur on the               to training ML models. In healthcare-related
results. Including our proposed steps for                  data, cross validation by experts is much more
automation in the current process will lower the           important to prevent fatal diagnosis errors and
burden of the experts and improve the timeframe            to check for any potential biases in the model.
up to 1/20 in comparison to the current process.           6. Human Control: It is important to have
    While our proposed project pipeline (Figure 1)         adequate human control so that the confidence
forecasts optimal benefits for cervical cancer             of the predicted results is higher. Enabling
screening, in laying the groundwork, we were               human control via the automation process’s
faced with critical challenges encompassing the            interface allows to spot any discrepancies and
realms of – technical, ethical, legal, and (most           malfunctioning.
importantly) end user facing challenges. In this           7. Transparency: The interface should be
Workshop on “Healthy Interfaces (HEALTHI)                  made simple and transparent for both non-
2022,” we look forward to discussing our research          medical and other non-tech savvy stakeholders
on automation of public health data analysis. We           involved. The entire automation process
hope to share our current challenges, methods,             should be comprehensible to all stakeholders
and future plans for AI powered healthcare.                involved for the project to succeed.
                                                           8. Legal Efforts and Approval: Last but
2. Challenges for Automation of                            the most important challenge is to succeed in
                                                           the legal efforts and approvals required for the
   Public Health Data Analysis                             automation projects. Developing proof of
                                                           concepts with publicly available datasets is
    In this section we generalise the problems we          one of the ways to prepare for the challenge of
faced when implementing our project (Figure 1)             gaining legal approvals and other grants
to discuss the challenges for automation of public
health data analysis:
    1. Trained Data Entry: The first challenge
                                                        3. Conclusion
    is the considerable effort needed to change the
    conventional data entry practices. To                   AI powered public healthcare will foster a
    automate, it is essential to construct database     health structure in the future where the AI process
    constraints, design helpful interfaces, and train   drives the speed and accuracy of the diagnosis,
    non-tech savvy workers to log: complete, error      treatment, and recovery. People will get the right
    free, and rightly formatted data.                   diagnosis at the right time such that their treatment
    2. Patient Privacy: Anonymising the data is         and recovery chances improve, thus improving
    important to preserving the privacy of personal     chances of a good life. Furthermore, the cost
    health records of patients who sign up for the      efficiency brought by AI techniques will enable
    study. If possible, it should be mindfully made     smart healthcare to be adapted to different
    visible at the level of the interface to both the   healthcare structures in different countries,
    patients and their clinicians.                      specifically in the low-income countries, so that
    3. Data Pre-processing: Data pre-                   healthcare becomes accessible and affordable
    processing is the cleaning and preparation of       there. This is a possibility only when AI
    data for the model and analysis tasks. This is a    researchers combine their expertise and
    time-consuming underestimated challenge, if         experiences with health researchers. With this
    done improperly, it potentially hinders the         position paper we aim to contribute by informing
    performance and accuracy of the model and           both medical professionals and computer
    delays the overall study.                           scientists of the challenges for automation of
                                                        public health data analysis.
4. References                                       [4] Liming Hu, David Bell, Sameer Antani,
                                                        Zhiyun Xue, Kai Yu, Matthew P Horning,
                                                        Noni Gachuhi, Benjamin Wilson, Mayoore S
[1] Almonte, Maribel, Raúl Murillo, Gloria Inés
                                                        Jaiswal, Brian Befano, L Rodney Long,
    Sánchez, Paula González, Annabelle Ferrera,
                                                        Rolando Herrero, Mark H Einstein, Robert D
    M A Picconi, Carolina Wiesner, Aurelio
                                                        Burk, Maria Demarco, Julia C Gage, Ana
    Cruz-Valdéz, Eduardo Lazcano-Ponce, Jose
                                                        Cecilia Rodriguez, Nicolas Wentzensen,
    Jeronimo, Catterina Ferreccio, Elena
                                                        Mark Schiffman, An Observational Study of
    Kasamatsu, Laura Patricia Mendoza,
                                                        Deep Learning and Automated Evaluation of
    Guillermo Rodríguez, Alejandro Calderón,
                                                        Cervical Images for Cancer Screening, JNCI:
    Gino Venegas, Verónica Villagra, Silvio
                                                        Journal of the National Cancer Institute,
    Alejandro Tatti, Laura Fleider, Carolina
                                                        Volume 111, Issue 9, September 2019, Pages
    Terán, Armando Baena, María de la Luz
                                                        923–932,
    Hernández, Mary-Luz Rol, Eric Lucas,
                                                        https://doi.org/10.1093/jnci/djy225
    Sylvaine Barbier, Arianis Tatiana Ramírez,
    Silvina Arrossi, Maria I. Rodriguez, E Díaz
    González, Marcela Celis, Sandra Martínez,
    Yuly Salgado, Marina Ortega, Andrea
    Verónica Beracochea, Natalia Pérez,
    Margarita M Rodríguez de la Peña, Maria de
    Sales Ramon, Pilar Hernández-Nevarez,
    Margarita Arboleda-Naranjo, Yessy Cabrera,
    Brenda Utrera Salgado, Laura García, Marco
    Antonio Retana, María Celeste Colucci,
    Javier A. Arias-Stella, Yenny Bellido-
    Fuentes, María Liz Bobadilla, Gladys
    Olmedo, Ivone Brito-García, Armando
    Méndez-Herrera, Lucía Cardinal, Betsy
    Flores, J F Márquez Peñaranda, Josefina
    Martínez-Better, Ana María Soilán,
    Jacqueline Figueroa, Benedicta Caserta,
    Carlos P. Sosa, Adrian A. Moreno, Juan
    Mural, Franco Doimi, Diana Giménez,
    Hernando Gutiérrez Rodríguez, Oscar Lora,
    Silvana Luciani, Nathalie Jeanne Nicole
    Broutet, Teresa M. Darragh and Rolando
    Herrero. “Multicentric study of cervical
    cancer screening with human papillomavirus
    testing and assessment of triage methods in
    Latin America: the ESTAMPA screening
    study protocol.” BMJ Open 2020 May
    24;10(5): e035796. doi: 10.1136/bmjopen-
    2019-035796. PMID: 32448795; PMCID:
    PMC7252979.
[2] Bray F, Jemal A, Grey N, Ferlay J, Forman
    D. Global cancer transitions according to the
    Human Development Index (2008-2030): a
    population-based study. Lancet Oncol.
    2012;13(8):790–801.
[3] Cho, BJ., Choi, Y.J., Lee, MJ. et al.
    Classification of cervical neoplasms on
    colposcopic photography using deep
    learning. Sci Rep 10, 13652 (2020).
    https://doi.org/10.1038/s41598-020-70490-4