34 Exploration of known and unknown early symptoms of cervical cancer and development of a symptom spectrum - Outline of a data and text mining based approach Claudia Ehrentraut1 , Karin Sundström2 , Hercules Dalianis1 1 Department of Computer and Systems Sciences, (DSV) Stockholm University, Sweden 2 Department of Medical Epidemiology and Biostatistics, (MEB) Karolinska Institutet, Stockholm, Sweden ehrentraut@dsv.su.se, hercules@dsv.su.se,karin.sundstrom@ki.se Abstract. This position paper delineates the structure of some experi- ments to detect early symptoms of cervical cancer. We are using a large corpora of electronic patient records texts in Swedish from Karolinska University Hospital from the years 2009-2010, where we extracted in total 1,660 patient records with the ICD-10 diagnosis code C53 for cer- vical cancer. We used a Named Entity Recogniser called Clinical Entity Finder to detect the diagnosis and symptoms expressed in these clinical texts containing in total 2,988,118 words. We found 28,218 symptoms and diagnoses on these 1,660 patients. We present some initial findings, and discuss them and propose a set of experiments to find possible early symptoms and/or a spectrum of early symptoms of cervical cancer. 1 Introduction and Motivation In the last ten years patient records have become, at least in Sweden, completely digitalized and also centralised in large repositories. This is a vast source of knowledge within medical research, however, this resource has not been much exploited. The reason is that clinical researchers have little or no knowledge in data and text mining, and also that these repositories due to their sensitive nature are difficult to access in order to perform research. Lately, these sources have become to a very small extent available to re- searchers in the U.S. as well as Europe. Meystre et al. [9], wrote a review article about different text mining approaches and tools, mostly for English textual data. Among others, they mention an approach to detect early symptoms of breast cancer. Dalianis et al. [3] describes clinical text mining including extrac- tion and retrieval specifically for use in Swedish patient records. It is timely to assess to what extent text mining can assist in the evaluation of symptoms encountered in the course of human cancer. It has been shown in an interview-based study that young females with cer- vical cancer frequently delay presentation, and not recognising symptoms as Proceeding from CAiSE 2015 Industriy Track Copyright © 2015 held by the author(s) Krogstie, Juel-Skielse, Kabilan (Eds.) 35 2 serious may increase the risk of delay [8]. Improved identification and awareness of early signs of cervical cancer may reduce both patient and provider delay of investigation and treatment. Thus, it could be highly valuable to establish the cervical cancer symptom spectrum and whether there are additional symptoms that should be added to this. Text mining could as a novel tool aid with a bias- free search of words and biochemical features that may not have been previously suspected/identified by patients or health care. The hypothesis in this project originates from the assumption that women with early cervical cancers and pre-cancers usually have no symptoms [14, 1]. So far, symptoms of a disease are mainly collected by means of capturing de- scriptions made by the patient spontaneously, or after being questioned by a health care professional. However, few of these are relevant for registration in national health registers. Thus, traditional register-based research cannot access such data. The project has two major aims: 1. Determine whether there are unknown early symptoms of cervical cancer, and if so which. This to potentially inform health care and screening pro- cesses of symptoms in women that may be of note. The anticipated output is to find unknown early symptoms of cervical cancer. In this regard, a list of concrete symptoms is considered to be the desired finding. 2. Develop and characterize a symptom spectrum for cervical cancer through a holistic description of symptoms as recorded in medical text by health care staff. Such a spectrum would include both previously known, and potentially unknown symptoms. The anticipated output is a holistic description of cer- vical cancer symptoms, i.e., likelihood of occurrence, time of occurrence and frequency of occurrence of diverse symptoms. Ideally, the symptom descrip- tion will be an interactive visualization, as for instance depicted in Figure 1. This serves the purpose of generating a better understanding of possible cervical cancer symptoms due to their potentially ambiguous nature. The purpose of both aims is to obtain a more concise understanding of symp- toms that occur in cervical cancer patients compared to non-cancer patients, Fig. 1. Proposed visualisation of cervical cancer symptoms spectrum, the height of the bar depicts the number of symptoms. One possibility is also to show number of negated symptoms, or absence of symptoms as negative bars. Proceeding from CAiSE 2015 Industriy Track Copyright © 2015 held by the author(s) Krogstie, Juel-Skielse, Kabilan (Eds.) 36 3 based on evidence that is gained through a statistical analysis of a large amount of medical data. We intend to approach these aims by applying and enhancing state of the art text mining tools. The overall goal is to use our findings as a complement in screening programs for cervical cancer. In addition to taking a screening test for cervical cancer, the physician could for example be able to run a program to filter out the patient’s symptoms, if captured in the medical record, and compare them to a list of possible early cervical cancer symptoms. Ideally, this approach should be generic in order to be applicable to other cancers. This paper intends to outline the current state of the art within cervical cancer prevention and how text mining is hitherto applied in the cancer domain. Further, this paper presents (1) initial experiments that have been performed as well as (2) an outlook on proposed work in order to find unknown early symptoms and develop a symptom spectrum for cervical cancer. 2 Background Cervical cancer (ICD-10 diagnosis code: C.53) is one of the most common cancers worldwide [2], frequently striking young women below age 40, if not screened [15]. A long-term infection with the Human papillomavirus (HPV), which spreads via sexual contact, is deemed a necessary but not sufficient factor in the development of cervical cancer [16]. Today, women are offered screening every three to five years, with the Pap test being most commonly used, in order to detect abnormal changes in the cells in an early stage. Cancer in an intermediate or advanced stage is highly mortal. Early diagnosis is therefore crucial in order to prevent treatable pre-cancer from turning into invasive cancer [1]. Early detection is yet often hindered since not all women wish to participate in cervical screening programs. Women who do not attend screening can be diagnosed via symptomatic pre- sentation. However, diagnoses of cervical cancer may be delayed because of the failure to recognize symptoms as cancer-related. As Lim et al. [7] found, some reasons for the delay may be that the patients (1) do not recognize possible can- cer symptoms, especially vaginal discharge, and (2) do not re-attend promptly after first presentation despite persisting symptoms. Delays in diagnosis do also occur on behalf of the provider who may fail to recognize cervical cancer-related symptoms. According to the state of the art assumption, women with early cervical cancers and pre-cancers usually have no symptoms [14, 1]. Yet, it is possible that there are blood value deviations or other unforeseen symptoms. In most cases, the symptoms do not start until the cancer has reached a more advanced stage. Usual gynecological symptoms at that point are (1) abnormal vaginal bleeding, (2) unusual discharge from the vagina and (3) pain during intercourse [7, 1]. Increasing the awareness of (early) cervical cancer symptoms among women and health care staff might improve diagnostics and chance of survival [6]. Find- ing hitherto unknown early symptoms which may appear during a pre-cancerous Proceeding from CAiSE 2015 Industriy Track Copyright © 2015 held by the author(s) Krogstie, Juel-Skielse, Kabilan (Eds.) 37 4 stage could further help to diagnose cervical cancer at a time when it is still treatable. Spasic et al. [13] reviewed different approaches for clinical text mining within the cancer domain. Of all studies the authors refer to, only two have focused on cervical cancer and HPV, respectively. The study focusing on cervical cancer aimed at finding a method for retrieving oncology documents relevant to clinical decision within the particular domain of cervix cancer. With a content-based text classification process and similarity analysis at its core, their system obtains its highest accuracy at 92% [11]. The study focusing on HPV aimed at discriminating high-risk HPV types, i.e., those that are related with cervical cancer, from low-risk types, i.e., those that are not related with it. Comparing three machine learning algorithms, namely AdaCost, AdaBoost and Naïve Bayes, the authors showed that Ada- Cost outperforms the other algorithms, yielding an accuracy of circa 93% and F-score of about 87% [10]. 3 Materials and Methods The researchers of the MINECAN1 project and this particular study have ac- cess to the Stockholm Electronic Patient Record (SEPR) Corpus that comprises patient records from 2006 to 2014 from Karolinska University Hospital in Stock- holm, Sweden, [4]. The corpus contains records from all units at Karolinska University Hospital except for records from the psychiatric and venereal disease unit. For the MINECAN project, a subcorpus2 is created from the SEPR Corpus. In order to approach the main goal of finding unknown early symptoms and creating a symptom spectrum for cervical cancer, the initial work comprised the construction of part of the subcorpus and initial experiments performed on that corpus. The approach used for this project resembles a retrospective case-control study. That means past medical records are used to identify exposure and out- come factors, e.g., potential exposures/symptoms for the outcome cervical can- cer. The study comprises a group of interest (study group) and a comparison (or control) group3 . 3.1 ICD-10 diagnosis codes The study group data consists of records that belong to patients diagnosed with cervical cancer. These patients are identified as having cervical cancer if an appropriate ICD-10 diagnosis code is found in their records. All cervical cancer related ICD-10 codes were specified by the project’s medical expert. They are: 1 MINECAN - Data and text mining of cancer symptoms and comorbidities in elec- tronic patient records in the Nordic languages 2 This research has been approved by the Regional Ethical Review Board in Stockholm (Etikprövningsnämnden i Stockholm), permission number 2014/1882-31/5 3 http://hsl.lib.umn.edu/biomed/help/understanding-research-study-designs Proceeding from CAiSE 2015 Industriy Track Copyright © 2015 held by the author(s) Krogstie, Juel-Skielse, Kabilan (Eds.) 38 5 – C53.0 (Malignant neoplasm: Endocervix) – C53.1 (Malignant neoplasm: Exocervix) – C53.8 (Malignant neoplasm: Overlapping lesion of cervix uteri) – C53.9 (Malignant neoplasm: Cervix uteri, unspecified) – D06.0 (Carcinoma in situ: Endocervix) – D06.1 (Carcinoma in situ: Exocervix) – D06.7 (Carcinoma in situ: Other parts of cervix) – D06.9 (Carcinoma in situ: Cervix, unspecified) – N87.2 (Severe cervical dysplasia, not elsewhere classified). The SEPR Corpus is stored in a database. Ultimately, the subset that is created from this corpus for the cervical cancer project will comprise records that belong to the study as well as as control group. As part of the first experiments, only data for the study group has been extracted. Defining and extracting data for the control group will be done at a later point in time. For the study group, the following information is extracted from the database using MySQL queries: – Gender and age of patient – Date of patients’ admission to and discharge from hospital – Clinic(s) where patient is treated – Daily note (free text) and corresponding date of entrance into hospital system during the years 2009-2010 Once the data is extracted, all information about the patients is saved into a text file with one file per patient, containing patient number, age and gender information as well as all the patients’ daily notes sorted by date. These files are then used for further processing and analysis. 3.2 Statistics of study group Statistics for the study group were obtained according to the following parame- ters: age, clinic, time of diagnosis, length of treatment. In total, 1,660 patients are contained in the study group. Of these patients 1,587 patients have obtained only one ICD-10 diagnosis code, i.e., a C53, D06 or N87 code. 72 patients have had two diagnosis codes in their records, 42 patients had C53 and D06 diagnosis codes in their records while for 29 patients, D06 and N87 co-occurred in the records. No patients had a C53 and N87 co-occurring in the record. For one patient, all three diagnosis codes occurred in the record. Of the 1,587 patients who only had one diagnosis, 603 had a C53 diagnosis code, 955 a D06 code and 29 a N87 code. The following section describes an initial approach of generating a frequency list of symptoms captured in records of patients that were assigned a C53 code4 on a small subset of the data. 4 Only using C53 codes and no D06 and N87 codes is motivated by the fact that we want to start testing Proceeding from CAiSE 2015 Industriy Track Copyright © 2015 held by the author(s) Krogstie, Juel-Skielse, Kabilan (Eds.) 39 6 Fig. 2. Age statistics, IDC-10 diagnosis codes C53, D06 and N87. This method aims at identifying symptom words in patient records, extract them from the records and sort them according to their frequencies. Ultimately, this step will be done for the records of the study group and the control group, resulting in two frequency lists, a cervical cancer list and a control list. The two frequency lists will be compared to one another to see – if and how the symptoms differ between cases and controls – if well-known cervical cancer symptoms are identified most frequently in the cervical cancer list or – if there are other symptoms that occur more frequently – whether our methods can accurately identify a priori known/suspected as- sociations, which should validate whether the methodology is appropriate As part of these first experiments, an initial cervical cancer frequency list was created in a two step process. – Identify all symptoms by using the tool Clinical Entity Finder (CEF) – Extract, sort and count all found symptoms and save them into a frequency list The Clinical Entity Finder, CEF, implements the idea/task of Named En- tity Recognition (NER), i.e., recognizing expressions denoting entities such as diseases, drugs, or people’s names in free text documents [9]. This task can be performed automatically and over the past years multiple NER algorithms have been implemented. NER modules for English are for instance available via the Stanford CoreNLP5 package or Apache OpenNLP6 . Skeppstedt et al. [12] have 5 http://nlp.stanford.edu/software/corenlp.shtml. 2014-09-08. 6 https://opennlp.apache.org/index.html. 2014-09-08. Proceeding from CAiSE 2015 Industriy Track Copyright © 2015 held by the author(s) Krogstie, Juel-Skielse, Kabilan (Eds.) 40 7 Table 1. Frequency list for cervical cancer records assigned ICD-10 codes C53.0, C53.1, C53.8 and C53.9 Term frequency Term Engl. translation 4252 smärta pain 3338 illamående nausea 1895 blödning bleeding 1735 opåverkad unaffected 1714 besvär trouble 1650 mår bra feel well 1528 ont ill 1498 smärtor pains 1495 feber fever 1472 trött tired implemented the Clinical Entity Finder that can automatically recognize enti- ties in narrative text of Swedish health records. The tool is based on CRF++, an implementation of the conditional random fields algorithm, and is initially implemented to detect the terms within the entity categories Disorder, Finding, Pharmaceutical Drug and Body Structure. After running CEF, the detected cervical cancer symptoms are sorted, counted and saved into a frequency list that is depicted in Table 1. 4 Results Table 1 depicts the 10 most frequent symptoms in patient records that con- tain one of the four ICD-10 codes C53.0, C53.1, C53.8 and C53.9. The entire frequency comprises 28,218 symptoms. We applied the Clinical Entity Finder, CEF, trained on annotated data from one domain to a different domain. To provide an estimate om how well CEF works within the new domain, one member of the group conducted a qualitative analysis of two pre-annotated patient records, by manually reading through them and checking whether the symptoms were annotated correctly. It was found that CEF, which is trained on data from the internal and medicine emergency domain, failed to detect some cervical cancer related terms. While cervix, cervix cancer, as well as the abbreviations cervixca., skivepitelca. were missed, cancer and cervixcancer were correctly detected. In the two files, 9 respectively 14 symptoms (findings + disorders) were negated, indicating the ab- sence of those particular symptoms. To sum up, CEF yielded promising results, missing only 3 to 6 percent of the symptoms per record. We identified several restrictions and drawbacks that have to be handled in order to obtain a representative frequency list of cervical cancer symptoms. – Multiple inflectional forms of the same word, such as smärta (Engl.: pain) and smärtor (Engl.: pains), occur in the frequency list. Using lemmatization, they should be reduced to their base form in order to only include the main symptom concept in the frequency list. Proceeding from CAiSE 2015 Industriy Track Copyright © 2015 held by the author(s) Krogstie, Juel-Skielse, Kabilan (Eds.) 41 8 – So far the frequency list contains symptoms which are negated and that should be removed from the list. Negation detection will need to be applied in order to filter out these symptoms. – Since we are interested in early symptoms, mainly daily patient notes that are added to the EHR before the cancer diagnosis are of interest. So far we used all patient notes that exist in the EHR for a patient with a cervical cancer diagnosis. A future task aims at using only those notes made before diagnosis, when detecting symptoms and generating frequency lists from them. – Identifying symptoms by applying CEF yielded promising results. Yet CEF should be adapted to the domain by using more domain relevant training data and incorporating negation. 5 Discussion During our research work we encountered some challenges. We are not yet at the stage of identifying any early unknown symptoms of cervical cancer but are able to succesfully confirm other known symptoms such as bleeding that is a possible symptom of cervical cancer. Some of the symptoms we identified were actually negated symptoms as not bleeding, findings that our system could not identify as negated findings/symptoms, since we did not use any negation detection system. Some of the symptoms which are enumerated in Table 1. are therefore negated. Findings that are in singular or plural form as bleeding or bleedings could be reduced to one base form using a lemmatizer. The same approach can be carried out for determined and non-determined form of nouns. Determined nouns in Swedish uses a inflection en to change to determined form; blödning+en => blödningen, instead of a modifier as in English, the bleeding. Reducing these identical findings would make the analyse easier and increase precision. Another obstacle was temporality, the patient record stretches over several months or years and we need a method to identify when something occurred. Certainly we have time stamps on each note, but within each note the physician sometimes refer to earlier findings and relate to them. Regarding identifying terms we saw that there are many non-standard words and abbreviations and compounds of abbreviations and words, as for example, cervixca., that CEF could could not identify as named entities. This could easiest be solved by adding in-domain annotated data. 6 Conclusions and Further Work This paper described the first steps towards finding unknown early symptoms and building a symptom spectrum for cervical cancer. As the projects progresses we plan to work on the following tasks: – Defining and extracting the control group Proceeding from CAiSE 2015 Industriy Track Copyright © 2015 held by the author(s) Krogstie, Juel-Skielse, Kabilan (Eds.) 42 9 – Testing and advancing the following methods to identify symptoms captured in the patient records: • NER and frequency counting approach • Named Entity Recognition and Random Indexing • Clustering – Using and adapting existing text mining tools for the domain and incorpo- rating them into the preceding methods: • Lemmatization • Negation and certainty detection • Temporality • Mapping symptoms to ICD-10 codes – Analyzing and assembling the results as well as designing a visual represen- tation for the developed symptoms spectrum. One limitation may be that aim 1, finding previously early and/or unknown symptoms of cervical cancer, cannot be fulfilled. However, this in turn could actually inform health care practice and confirm the current evidence base for cervical cancer as a relatively symptom-less disease, demonstrated by systemat- ically exploiting a novel data source; medical records. Regardless of aim 1, our aim 2 should provide valuable information on the symptom spectrum in cervical cancer. This paper has outlined the current state-of-the-art within cervical cancer prevention and how text mining is hitherto applied in the cancer domain. Fur- ther, this paper presented (1) initial experiments that have been performed as well as (2) an outlook on proposed work in order to find unknown early symptoms and develop a symptom spectrum for cervical cancer. We believe that outlining the scope of the project, including aims, state-of- the-art research, proposed future work and limitations, as well as performing initial experiments was crucial for enabling a stringent work flow in the project. Our methodology can also been seen as a part of the HEALTH BANK work- bench proposed in [5], that will offer processed aggregated and unaggregated clinical data for research to be used in a wider context. Acknowledgements The authors would like to thank the Nordic Information for Action eScience Center of Excellence in Health-Related e-Sciences (NIASC) and Nordforsk for funding of the project and to Eric Thuning and Per "Pelle" Olofsson; both IT experts at DSV for help with the management of the Stockholm EPR Corpus server. We would also like to thank Maria Skeppstedt for letting us use the Clinical Entity Finder and for Aron Henriksson for assisting us in executing Clinical Entity Finder. Proceeding from CAiSE 2015 Industriy Track Copyright © 2015 held by the author(s) Krogstie, Juel-Skielse, Kabilan (Eds.) 43 10 References 1. American Cancer Society, A.: Cervical Cancer Prevention and Early Detection (2014), http://www.cancer.org/acs/groups/cid/documents/webcontent/003094- pdf.pdf 2. Cancer Research UK, U.: Worldwide cancer incidence statistics, http://www.cancerresearchuk.org/cancer-info/cancerstats/world/ incidence/Common, visited: November 13th 2014 3. Dalianis, H.: Clinical text retrieval - an overview of basic building blocks and applications. In: Paltoglou, G., Loizides, F., Hansen, P. (eds.) Professional Search in the Modern World, vol. 8830, pp. 147–165. Springer Verlag, Lecture Notes in Computer Science (2014) 4. Dalianis, H., Hassel, M., Henriksson, A., Skeppstedt, M.: Stockholm EPR Corpus: A clinical database used to improve health care. In: Swedish Language Technology Conference. pp. 17–18 (2012) 5. Dalianis, H., Henriksson, A., Kvist, M., Velupillai, S., Weegar, R.: HEALTH BANK – A Workbench for Data Science Applications in Healthcare. In: Proceedings of CAiSE’15 – Industry Track. Springer Verlag, Lecture Notes in Computer Science (2015) 6. Swedish Council on Health Technology Assessment, SBU, S.: Tidig upptäckt av symtomgivande cancer - En systematisk litteraturöversikt, (In Swedish), (January 2014), http://www.sbu.se/upload/Publikationer/Content0/1/ Tidig_upptackt_symtomgivande_cancer_smf.pdf 7. Lim, A.W., Ramirez, A.J., Hamilton, W., Sasieni, P., Patnick, J., Forbes, L.J.: De- lays in diagnosis of young females with symptomatic cervical cancer in england: an interview-based study. British Journal of General Practice pp. e602–e610 (October 2014) 8. Lim, A.W., Forbes, L.J., Rosenthal, A.N., Raju, K.S., Ramirez, A.J.: Measuring the nature and duration of symptoms of cervical cancer in young women: developing an interview-based approach. BMC women’s health 13(1), 45 (2013) 9. Meystre, S., Savova, G., Kipper-Schuler, K., Hurdle, J.: Extracting information from textual documents in the electronic health record: a review of recent research. IMIA Yearbook of Medical Informatics 47, 128–144 (2008) 10. Park, S.B., Hwang, S., Zhang, B.T.: Mining the risk types of human papillo- mavirus (HPV) by AdaCost. In: Mařík, V., Retschitzegger, W., Štěpánková, O. (eds.) Database and Expert Systems Applications. Springer (2003) 11. Polpinij, J., Miller, A.: Ontology-based text analysis approach to retrieve oncology documents from PubMed relevant to cervical cancer in clinical trials. In: ICDM Workshop on Advances in Data Mining. IBaI Publishing, Leipzig (2010) 12. Skeppstedt, M., Kvist, M., Nilsson, G.H., Dalianis, H.: Automatic recognition of disorders, findings, pharmaceuticals and body structures from clinical text: An annotation and machine learning study. Journal of Biomedical Informatics 49, 148– 158 (June 2014) 13. Spasić, I., Livsey, J., Keane, J.A., Nenadić, G.: Text mining of cancer-related in- formation: Review of current status and future directions. International journal of medical informatics 83(9), 605–623 (2014) 14. Storck, S.: Cervical dysplasia. Online (2014), http://www.nlm.nih.gov/ medlineplus/ency/article/001491.htm, medlinePlus 15. Sundström, K.: Human Papillomavirus Test and Vaccination - Impact on Cervical Cancer Screening and Prevention. Ph.D. thesis, Department of Medical Epidemi- ology and Biostatistics, Karolinska Institutet, Stockholm, Sweden (2013) Proceeding from CAiSE 2015 Industriy Track Copyright © 2015 held by the author(s) Krogstie, Juel-Skielse, Kabilan (Eds.) 44 11 16. Walboomers, J.M.M., Jacobs, M.V., Manos, M.M., Bosch, F.X., Kummer, J.A., Shah, K.V., Snijders, P.J.F., Peto, J., Meijer, C.J.L.M., Muñoz, N.: Human papil- lomavirus is a necessary cause of invasive cervical cancer worldwide. The Journal of Pathology 189(1), 12–19 (1999) Proceeding from CAiSE 2015 Industriy Track Copyright © 2015 held by the author(s) Krogstie, Juel-Skielse, Kabilan (Eds.)