Challenges for Automation of Public Health Data Analysis Ravi Shankar Grenoble, France rsps1001@gmail.com Abstract Advancements in Machine Learning and Data Science are not adequately reflected in how public health data is handled today. There is a visible gap between the advances in computing and medical sciences. In this position paper, we present an example of data science applied to the automation of a repetitive process within a cervical cancer screening program. We discuss the challenges for automating public health data and share our insights to elevate artificial intelligence (AI) in public healthcare. Keywords 1 Public health, Data analysis, Cancer research, Automation 1. Introduction a microscope. This is an eye-dependent process, therefore inter- and intra-variability is present, and external revision is often needed as part of More than 80% of the cervical cancer cases quality assurance (QA). The full process and deaths in a year occur in low medium income including QA may not be affordable, particularly countries (LMICs) where prevention and cervical in LMICs. Hence, AI contributes to eliminate screening resources are limited [1][2]. Recent such variability while saving time and resources. research studies have used machine learning models to support the initial phase of screening for detection of cancerous lesions using colposcopic images or cervicography[3][4]. These techniques require tech-savvy healthcare workers who are very scarce per capita in these countries. We aim to build a user-friendly automation that would allow medical experts to diagnose cancerous tissues of the cervix in a short period of time while reducing costs and technical experience required. This idea will work by Figure 1: Project pipeline illustrating the combining heath and AI researchers’ expertise automation of public health data analysis and experiences. involving human reviewers who validate the The main problem we aim to address is Machine Learning model’s prediction results. diagnosing biopsied women within a cervical cancer program. Our motivation is driven by the importance and time consumption of pathology Figure 1 illustrates the example of a proposed process (i.e., pathologists reading histological pipeline in which we aim to automate the steps slides). In the pathology process, women testing from fetching of biopsy-based cervical data positive on screening tests are referred to within a cervical cancer screening program. We specialised examination (colposcopy) to collect then pre-process the fetched data, followed by biopsy samples from the cervix and then training our machine learning (ML) model to haematoxylin and eosin (H&E) histological slides make two or three prediction sets (ensuring QA) are prepared to be reviewed by pathologists using for human reviewers to validate, and finally generate the reports of the analysis. The current Joint Proceedings of the ACM IUI Workshops 2022, March 2022, Helsinki, Finland EMAIL: rsps1001@gmail.com (A. 1) Copyright © 2022 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Wor Pr ks hop oceedi ngs ht I tp: // ceur - SSN1613- ws .or 0073 g CEUR Workshop Proceedings (CEUR-WS.org) process (highlighted in yellow in Figure 1) 4. Handling Large Datasets: As public excludes these steps for automation (highlighted health studies are ongoing processes which in blue in Figure 1). include participants on a rolling basis, they can While the system works with the current result in large datasets during overall period of process, the automation steps are currently done the study (which might span several years). It manually and repeatedly by a group of is crucial to prepare for handling the data in pathologists and statisticians. As the ML model batches for faster training of the model. does not exist in the current process, the analysis 5. Cross Validation by Experts: reports are produced after 2-3 stages of reviews Validation of results is a necessity with respect involving multiple meetings to concur on the to training ML models. In healthcare-related results. Including our proposed steps for data, cross validation by experts is much more automation in the current process will lower the important to prevent fatal diagnosis errors and burden of the experts and improve the timeframe to check for any potential biases in the model. up to 1/20 in comparison to the current process. 6. Human Control: It is important to have While our proposed project pipeline (Figure 1) adequate human control so that the confidence forecasts optimal benefits for cervical cancer of the predicted results is higher. Enabling screening, in laying the groundwork, we were human control via the automation process’s faced with critical challenges encompassing the interface allows to spot any discrepancies and realms of – technical, ethical, legal, and (most malfunctioning. importantly) end user facing challenges. In this 7. Transparency: The interface should be Workshop on “Healthy Interfaces (HEALTHI) made simple and transparent for both non- 2022,” we look forward to discussing our research medical and other non-tech savvy stakeholders on automation of public health data analysis. We involved. The entire automation process hope to share our current challenges, methods, should be comprehensible to all stakeholders and future plans for AI powered healthcare. involved for the project to succeed. 8. Legal Efforts and Approval: Last but 2. Challenges for Automation of the most important challenge is to succeed in the legal efforts and approvals required for the Public Health Data Analysis automation projects. Developing proof of concepts with publicly available datasets is In this section we generalise the problems we one of the ways to prepare for the challenge of faced when implementing our project (Figure 1) gaining legal approvals and other grants to discuss the challenges for automation of public health data analysis: 1. Trained Data Entry: The first challenge 3. Conclusion is the considerable effort needed to change the conventional data entry practices. To AI powered public healthcare will foster a automate, it is essential to construct database health structure in the future where the AI process constraints, design helpful interfaces, and train drives the speed and accuracy of the diagnosis, non-tech savvy workers to log: complete, error treatment, and recovery. People will get the right free, and rightly formatted data. diagnosis at the right time such that their treatment 2. Patient Privacy: Anonymising the data is and recovery chances improve, thus improving important to preserving the privacy of personal chances of a good life. Furthermore, the cost health records of patients who sign up for the efficiency brought by AI techniques will enable study. If possible, it should be mindfully made smart healthcare to be adapted to different visible at the level of the interface to both the healthcare structures in different countries, patients and their clinicians. specifically in the low-income countries, so that 3. Data Pre-processing: Data pre- healthcare becomes accessible and affordable processing is the cleaning and preparation of there. This is a possibility only when AI data for the model and analysis tasks. This is a researchers combine their expertise and time-consuming underestimated challenge, if experiences with health researchers. With this done improperly, it potentially hinders the position paper we aim to contribute by informing performance and accuracy of the model and both medical professionals and computer delays the overall study. scientists of the challenges for automation of public health data analysis. 4. References [4] Liming Hu, David Bell, Sameer Antani, Zhiyun Xue, Kai Yu, Matthew P Horning, Noni Gachuhi, Benjamin Wilson, Mayoore S [1] Almonte, Maribel, Raúl Murillo, Gloria Inés Jaiswal, Brian Befano, L Rodney Long, Sánchez, Paula González, Annabelle Ferrera, Rolando Herrero, Mark H Einstein, Robert D M A Picconi, Carolina Wiesner, Aurelio Burk, Maria Demarco, Julia C Gage, Ana Cruz-Valdéz, Eduardo Lazcano-Ponce, Jose Cecilia Rodriguez, Nicolas Wentzensen, Jeronimo, Catterina Ferreccio, Elena Mark Schiffman, An Observational Study of Kasamatsu, Laura Patricia Mendoza, Deep Learning and Automated Evaluation of Guillermo Rodríguez, Alejandro Calderón, Cervical Images for Cancer Screening, JNCI: Gino Venegas, Verónica Villagra, Silvio Journal of the National Cancer Institute, Alejandro Tatti, Laura Fleider, Carolina Volume 111, Issue 9, September 2019, Pages Terán, Armando Baena, María de la Luz 923–932, Hernández, Mary-Luz Rol, Eric Lucas, https://doi.org/10.1093/jnci/djy225 Sylvaine Barbier, Arianis Tatiana Ramírez, Silvina Arrossi, Maria I. Rodriguez, E Díaz González, Marcela Celis, Sandra Martínez, Yuly Salgado, Marina Ortega, Andrea Verónica Beracochea, Natalia Pérez, Margarita M Rodríguez de la Peña, Maria de Sales Ramon, Pilar Hernández-Nevarez, Margarita Arboleda-Naranjo, Yessy Cabrera, Brenda Utrera Salgado, Laura García, Marco Antonio Retana, María Celeste Colucci, Javier A. Arias-Stella, Yenny Bellido- Fuentes, María Liz Bobadilla, Gladys Olmedo, Ivone Brito-García, Armando Méndez-Herrera, Lucía Cardinal, Betsy Flores, J F Márquez Peñaranda, Josefina Martínez-Better, Ana María Soilán, Jacqueline Figueroa, Benedicta Caserta, Carlos P. Sosa, Adrian A. Moreno, Juan Mural, Franco Doimi, Diana Giménez, Hernando Gutiérrez Rodríguez, Oscar Lora, Silvana Luciani, Nathalie Jeanne Nicole Broutet, Teresa M. Darragh and Rolando Herrero. “Multicentric study of cervical cancer screening with human papillomavirus testing and assessment of triage methods in Latin America: the ESTAMPA screening study protocol.” BMJ Open 2020 May 24;10(5): e035796. doi: 10.1136/bmjopen- 2019-035796. PMID: 32448795; PMCID: PMC7252979. [2] Bray F, Jemal A, Grey N, Ferlay J, Forman D. Global cancer transitions according to the Human Development Index (2008-2030): a population-based study. Lancet Oncol. 2012;13(8):790–801. [3] Cho, BJ., Choi, Y.J., Lee, MJ. et al. Classification of cervical neoplasms on colposcopic photography using deep learning. Sci Rep 10, 13652 (2020). https://doi.org/10.1038/s41598-020-70490-4