The Handling of Missing Values in Medical Domains with Respect to Pattern Mining Algorithms Danilo Schmidt2 , Matthias Niemann1 , and Gabriela Lindemann-von Trzebiatowski3 1 Department of Transfusion Medicine, University Hospital Charite matthias.niemann@charite.de 2 Department of Nephrology, University Hospital Charite danilo.schmidt@charite.de 3 Department of Governing Bodies, Humboldt University of Berlin gabriela.lindemann@uv.hu-berlin.de Abstract. Missing values are a wide spread problem in analyzing large data sets. In the medical domain they are unavoidable and complete analyzing methods fail here. In the paper we give an overview of kinds of missingness and common methods to handle missing values in machine learning algorithms. We introduce the Charité Query Language Toolkit which was developed to find out similar patterns in patient data records with respect to post-kidney-transplant patients. The toolkit uses available case analysis methods combined with a preprocessing of missing values as a compromise of simplicity and functionality. Key words: data mining, medical data, missing values 1 Introduction and Background Missing values are a wide spread problem for analyzing methods, such as machine learning, pattern recognition or data-mining algorithms, in many domains. For medical data sets missing values are unfortunately unavoidable. In a complete case analysis for these data sets all patient records with missing data would excluded. Performing clin- ical studies only with complete patient data sets lead to a significantly smaller sample size with reduced statistical expressiveness. Depending of the choosen method for the statistical analysis missing values can restrict the cohort so much that the whole study is endangered. In the last decades the amount of electronically collected patient data has grown rapidly and the demand of researchers and physicians for the development of analyzing meth- ods and tools for data-sets with missing values is obvious. In our work, we will describe the different kinds of missing values and follow here in principle the systematic of Pigott [2] and de Goeij et.al. [1]. We give an impression how to deal with missing values by example of pattern mining algorithms and introduce some useful preprocessing methods for medical data. At last we present a short example for including these methods in our frequent pattern mining toolkit. 148 2 Kinds of Missingness Missing values are a common issue when analyzing data in a wide range of research fields. In the medical domain it seems unavoidable, especially in long-term treatments. De Goeij et.al. [1] define a missing value as "hiding the value of an attribute". While analyzing a dataset a missing value occurs when the specific value is not available. This does not necessarily mean that the value does not exist, but it is unknown. The missing value may be one of the attribute values (e.g. a categorical value) or a unique value (e.g. a numerical value). To quantify missingness, a ratio of missing values and all values can formed over all attributes by a simple formula: |Bmissing | missingness(B) = (1) |B| There are several reasons for the missingness of values in medical data-sets. Depend- ing on the ratio missingness(B) and the applied analyzing method, missing values may distort the final result and the underlying data missing mechanism may cause a biased statistical analysis. Therefore it is appropriate to spend some considerations into the kind of missingness of the values of a special data-set before choosing an adequate an- alyzing algorithm. With respect to the reasons of missingness there are distinguished three categories MCAR, MAR and NMAR. MCAR Ű missing completely at random - is the strongest assumption for missing val- ues of a dataset. The missing of a value neither depends on the observed parameters nor on the unknown value itself. MCAR will not bias the analysis of data, because the miss- ing data has the same distribution as the available data. In medical domain it occurs e.g. if it was forgotten to induce an examination or there were problems in the transmission of laboratory data. A less strict mechanism is MAR Ű missing at random. The missing of a value is al- lowed to be dependent on the observed parameters but not on the missing value itself. In long-term treatment of a patient it happens e.g. if the patient was not motivated to come to a medical control round because he had no health problems. When adjusting a set of variables, MAR can be avoided by selecting highly correlated variables to be observed. But clearly, this requires a specialist with domain-specific knowledge. To the third cat- egory of missingness belong data where values are not missing at random Ű NMAR. Here the missing of a value depends on the value itself. E.g. a creatinine value of a kidney transplanted patient was not measured because of rejection and loss of the trans- plant. Without evaluating the dataset, no type of missing value can be ruled out. Even worse, each type of missingness can occur in a single dataset where different mechanisms over- lay as our instances above show. Furthermore, it is not trivial to find out what kind of missingness applies. 149 3 Preprocessing Missing Values for Application in Pattern Mining Algorithms There are several ways to handle missing values in pattern mining and data-analysis algorithms. Especially in the medical domain there are several studies on different ap- proaches of dealing with missing values. Van der Heijden et. al. [3] and Marlin [4] de- veloped methods for handling missing values for several machine learning techniques. The most common methods are: 1. Complete Case Analysis (only for remaining complete rows), 2. Available Case Analysis (complete rows for current pattern), 3. Single Unconditional Mean Imputation (impute column’s mean), 4. Single Conditional Mean Imputation (impute mean based on conditional columns), 5. Multiple Imputation (regression model generates complete sets), 6. Maximum Likelihood (estimate underlying distributions), 7. Pattern-Mixture Model (user defined patterns of missingness). All methods are suitable for MCAR data-sets. Furthermore, Single Conditional Mean Imputation, Multiple Imputation, Maximum Likelihood and Pattern-Mixture Model are additional usable for MAR data-sets, but only the last one is practicable for NMAR data-sets. In the following we will introduce in short the handling of missing values for the mentioned methods. 3.1 Complete Case Analysis The easiest way to handle missing values is to delete all cases (rows) that contain miss- ing values. The remaining data will be complete and all methods requiring complete data sets can be applied without further issues. When applying Complete Case analysis (CC) to table 1, the patients 2, 3 and 4 would be discarded. In subsequent operations, only patient 1 would be considered. If the missing values were not missing completely at random, Complete Case analysis will be biased. Unfortunately MCAR applies rarely (see table 2). Furthermore, the deletion of many cases is not applicable if there are missing values in almost every case. The loss of information would increase while the significance decreases [3]. Table 1. Data set with missing values Patient Test A Test B Test C 1 1.0 positive positive 2 3.0 negative ? 3 ? negative negative 4 ? ? negative 150 3.2 Available Case Analysis Available Case analysis (AC), which is sometimes referred to as pair wise deletion, is less strictly than Complete Case analysis. When analyzing a subset of the observed variables, all complete cases for that subset are viewed. That means only missing values are ignored [3]. Considering tests B and C of table 1, patients 2 and 4 are discarded because for the given set of tests only patients 1 and 3 are complete. This method is rather easily applicable but has some drawbacks as mentioned by [3]. Because of the varying number of observations, errors in estimated covariance matrices might occur. Furthermore, only if missing values are MCAR, the estimates are consistent. It has been shown that Available Case analysis is superior to complete case analysis for weakly correlated variables. For strong correlations, AC is inferior to CC. Table 2. As A2 is only done if A1 is positive, the missing of A2 is MAR Patient A1 A2 1 negative ? 2 negative ? 3 positive negative 4 negative ? 5 positive positive 6 positive positive 3.3 Single Unconditional Mean Imputation A contrary approach to CC and AC is imputation. Generally spoken, missing values will be filled up (imputed) with calculated values. After that procedure the data can be handled like a complete set. There are different methods of imputing values, which will be introduced briefly in the following. The Single Unconditional Mean Imputation (sometimes referred to as single value im- putation) replaces all missing of an observed variable by the mean of the available values of that variable [3]. In table 1, the mean of test A is 2.0. Hence the missing val- ues of patients 3 and 4 would be replaced with 2.0. There are several drawbacks of that method: The variance of the imputed variables decreases while the precision is overrated. The results of an unconditional mean im- putation will always be biased. This method is rather easily applicable as there is no further information about the dataset required. Adapting this approach to categorical data, the most frequent value of a column is imputed. In order to avoid precision over- rating, the unconditional imputation may be extended to analyze columns in order to find the underlying random distribution. The imputation is then based on the columns distribution. 151 3.4 Single Conditional Mean Imputation An improvement to unconditional imputation is the conditional imputation method. By linear regression on the conditional (observed) variables with complete data, the missing values are imputed. When considering test B as the condition for test A in table 1, the mean of column A is calculated for all patients where test B is negative (mean = 3.0) and once more for all patients where test B is positive (mean = 1.0). Hence the imputed value for test A of patient 3 would be 3.0. The selection of conditional variables is not trivial. When selecting too many columns, the imputed value may be over fitted. 3.5 Multiple Imputation In contrast to the previously mentioned single imputation methods, Multiple Imputa- tion (MI) does not calculate a single mean of an observed variable in order to impute a missing value, but creates a set of possible complete data sets. Each imputed parameter is selected by the columns underlying random distribution that was determined by re- gression. On each complete set the analysis is done. Finally, the results will be brought together. Practical tests show that MI often performs better than CC and AC, especially in the field of nephrology. This holds for MCAR and MAR data as long as the model specification is suitable. An overrated precision is avoided by imputing data several times, while biasing is avoided by applying regression. 3.6 Maximum Likelihood Maximum Likelihood (ML) methods estimate the parameters of the underlying distri- butions of the observed variables. To get the most probable parameters, an EM algorithm can be used. If the algorithm converges, the coefficients with the highest likelihood can be used in linear regression models [3]. In contrast to imputation methods, there are no estimated values filled into the gaps of the data set. Instead, ML methods can help to provide significant estimates for regression models. ML is unbiased for data that is MCAR or MAR and outperforms CC, AC and single imputations methods. But a proper statistical model is fundamental. 3.7 Pattern-Mixture Model For pattern-mixture models the missing data mechanism may remain unknown. Instead, a mixture of different patterns describes the missingness in the data set, whereas each pattern describes a subset of the missing values. These patterns support the statistical model and can therefore improve the analysis. Hence pattern mixture models can pro- duce good estimates for data that is MCAR, MAR and NMAR. Unfortunately, creating patterns requires a lot of domain specific knowledge about the data. 4 The Charité Query Language Toolkit The Charité Query Language Toolkit was developed to find out similar patterns in pa- tient data records with respect to post-kidney-transplant patients. Physicians should be 152 enabled to find out similar courses of diseases and treatments to infer from it for actual cases. For the development of a toolkit, it might be disappointing that there is no simple gen- eral purpose method that handles all missing values in each imaginable query, especially if the data missing mechanism is unknown. Calders et.al. [5] summarizes the common methods and proposes to use different approaches in order to estimate the robustness. Applying multiple methods handling missing values might be confusing for future users of the software, so a compromise has to be found. The toolkit focuses on an easy us- age. The user is not expected to provide additional information about the data. For that reason the user cannot be asked for selecting variables to impute the data. Model based approaches introduce better estimations at the cost of higher complexity and therefore have to be avoided too. Since there are different data sources, lots of missing values can be expected. In complex queries, complete case analysis can lead to a drop of all transactions. Even in simple queries the missingness may be very high. That is why complete case analysis has to be avoided either. The toolkit uses available case analysis methods combined with a preprocessing of missing values as a compromise of simplicity and functionality. It does not focus on creating statistically faultless results. Biased correlations caused by violations of the MCAR-property can be expected and are accepted for that purpose. When searching for new correlations, the user may not be interested in strong and there- fore possibly known rules but in weaker or overlooked associations. In the design phase of the toolkit, two essential settings have to be done. Firstly, the definition of time slices in order to discretize the time axis is required (see figure 1). In the second step, the definition of norms in order to provide a discretization of the parameters is necessary. Norm values of parameters can be set in the norms tab (see figure 2). A group of norms for the same parameter was named norm family. This classification is necessary in order to recognize missing values. The norm type depends on the detected parameter as some tests generate qualitative (e.g. HCV, CMV, etc.) and some result in quantitative values (e.g. heart rate, creatinine, etc.). Qualitative norms are simply a mapping of a string value to the norm’s name. This allows to assign several values to the same category (e.g. weak and positive are both mapped to not negative). Quantitative norms are ranges for numeric parameters that are mapped to the name of a norm (e.g. a creatinine value of 1.2 to 6.0 is mapped to bad). Norms can be created automatically as well. The database contains references for sev- eral parameters that may be loaded. Furthermore, a function calculating quartiles and creating norms by these is available. 5 Discussion and Conclusion In our paper we give a survey of kinds of missing values and common methods to handle them in pattern mining algorithms. We introduce the Charité Query Language Toolkit which was configured to work on post-kidney-transplant patient data. Since the data does not differ from other medical domains, the toolkit may be used in other 153 Fig. 1. time slices Fig. 2. norms 154 departments as well. Either a separate database is provided or the data is loaded into the current database. Depending on the domain, individual preprocessing plug-ins might be necessary in order to provide proper data transformation abilities. References 1. de Goeij MC, van Diepen M, Jager KJ, et al. Multiple imputation: dealing with missing data. Nephrol Dial Transplant. 2013 Oct;28(10):2415-2420. 2. Pigott TD. A Review of Methods for Missing Data. Edicational Research and Evaluation. 2001;7(4). 3. van der Heijden GJMG, Donders ART, Stijnen T, et al. Imputation of missing values is su- perior to complete case analysis and the missingindicator method in multivariable diagnostic reaearch: A clinical example. Journal of Clinical Epidemiology. 2006;59:1102Ű 1109. 4. Marlin BM. Missing Data Problems in Machine Learning. Canadian the- ses. Library and Archives Canada Ű Bibliotheque et Archives Canada; 2008. http://books.google.de/books?id=5FlBPwAACAAJ (accessed 25 June 2015). 5. Calders T, Goethals B, Mampaey M. Mining Itemsets in the Presence of Missing Values. In: Proceedings of the 2007 ACM Symposium on Applied Computing. SAC Š07. New York, USA: ACM; 2007:404Ű408. http://doi.acm.org/10.1145/1244002.1244097 (accessed 25 June 2015). 6. Niemann M, Schmidt D, Lindemann von Trzebiatowski G, et al. First Steps towards a Frequent Pattern Mining with Nephrology Data in the Medical Domain. In: CSP, editor. Proceedings of the 21th International Workshop on Concurrency, Specification and Programming. vol. 2;2012:261-268