Big data techniques to discover kidney problems at early stages: a prospective study Omar García-González1, Ivan E. Villalon-Turrubiates1 and Pilar Pozos-Parra2 1Instituto Tecnológico y de Estudios Superiores de Occidente (ITESO) Periférico Sur Manuel Gómez Morín 8585, 45604 Tlaquepaque, Jalisco, México 2Universidad Autónoma de Baja California Calzada Universidad 14418, 22424 Tijuana, B.C, México ng724433@iteso.mx, villalon@iteso.mx, maria.pozos@uabc.edu.mx Abstract. Chronic Kidney Disease is a decrease in the kidney function, which can eventually derive in the cessation of the total function. It affects around 10 to 15% of the adults globally. This number is expected to grow as the diabetes disease is grow- ing, and kidney disease is one of the consequences of diabetes. Some computational tools and techniques related to big data, machine learning, clustering, signal and im- age processing, and data mining among others, promises huge benefits for medical research through technologies that can provide a better categorization of the infor- mation, which will derive on an easier way to analyze data and convert it into valuable information for decision-making. This paper presents an analysis of the state of the art, and previous advances of the use of technology for discovering the presence of kidney diseases. This will lead to alternative and novel ways to detect the disease on its initial stages, with the aim of supporting the medical decision-making process. Keywords: engineering in medicine, big data applications, machine learning 1 Introduction As referred by Mayer [1], the purpose of the analysis of data is no longer simply answering existing questions but generating new hypotheses. Using the power of well-organized abun- dant information, guides us to the creation of knowledge in a faster ratio than we ever im- agined. Some none invasive techniques are being used to predict kidney malfunction. There are three key areas, where Big Data differentiates from any existing conventional analyses of a given data sample: 1. Data is being captured in a more comprehensive way. 2. Inclusion of new techniques like machine learning, which according to Murphy [2] is the set of methods that can automatically help us by detecting patterns in data, which we can use to predict future data, or to perform decision making under uncertainty. 3. Creation of new hypotheses. Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) 150 This takes us to the conclusion that if we take advantage of the benefits of Big Data, we can accelerate the research in any field; medicine is not an exception. 2 The big data paradigm 2.1 Historical Review The growth rate in the volume of data, popularly known as the “information explosion” [3] was first used according to the Oxford English Dictionary in 1941. Since then, multiple major milestones in the history of the big volumes of data have been reached: 1. On 1944, Fremont Rider [4] (Wesleyan University Librarian) estimated that the Ameri- can university libraries, were doubling the size of their amount of information every six- teen years. 2. On April 1980, Tjomsland [5] said that the large amounts of data are being retained, because the users involved have no way of identifying obsolete data, besides the fact that the consequences associated for storing obsolete data, are less than the ones associated to discard potentially useful data. 3. On October 1998, Coffman [6] concluded that the growth rate of traffic on the public Internet was about 100% per year; data traffic overtook voice traffic on the U.S. by 2002. 4. On March 2007, Gantz [7] estimated that the information added year by year to the digital universe, will increase more than six-fold to 988 Exabyte’s, doubling every 18 months. (On a follow up release of the same study, that forecast was surpassed reaching 1227 Exabyte’s in 2010, and getting to 2837 Exabyte’s in 2012). 5. On February 2011, Hilbert [8] estimated that in the year of 1986, 99.2% of the storage capacity was analog, while in 2007 94% was digital (it was on 2002, that digital infor- mation surpassed the non-digital for the first time). 2.2 Big Data Definition Despite the fact that there is not a simple and widely accepted definition of the term Big Data, multiple authors and data engineer experts uses three V’s to approach a definition (volume, velocity and variety). Volume refers to the amount of data, variety to the different types and sources of data, and velocity to the speed needed to process it. The increase of the volume of data, as well as the variety of data sources and formats is exponentially growing year by year, while the users are demanding an increase in the velocity to access and/or process the data. 151 3 We define the big data management as the set of techniques and tools necessary to deal with the increasing amount of information, which has the purpose of organizing and cate- gorizing the information in a better way, so that we can extract valuable information from huge data-bases, at the maximum possible speed. Besides the challenges that the volume, variety and velocity causes, big data is becoming extremely valuable in the development of new knowledge for multiple business and indus- tries. 3 Engineering for chronical kidney disease analysis Multiple studies use technology to generate faster advances on a research. In the particular case of Chronical Kidney Disease (CKD), there are several examples where the analysis of data, has raised the interest of both medical and Big Data researches. 3.1 Medical Studies A recent study analyses the early stage of kidney complication caused by Diabetes Mellitus (DM) through the analysis of the iris image of the patient [9]. Iridology is a technique based on the shape and structure inside the iris, which can picture the body system. According to this technique, anything that happens in the body is reflected as a sign in the eye-iris. In his analysis, Prayitno [9] used 47 participants, from which 31 were preliminary diag- nosed with DM, while 16 had no prior indication of DM or any kidney damage. They all had their eye images captured and analyzed, and also a blood test was taken from them. From the 47 patients, 36 of them (76%) were showing a broken tissue, related to the kidney location on the iris. From the 31 participants that were preliminary diagnosed with DM, a 100% were reflecting broken tissue. This concludes that all patients with DM, showed any type of complication with their kidney, and this was reflected in the image of their iris. According to a recent study [10] there are strong links between depression and anxiety and CKD. Despite the fact that anxiety and depression in advanced stages sound logically related, studies have demonstrated higher prevalence rates of depression in patients with CKD than other chronic diseases. Depression is an emotional state, which is characterized by causing somatic and cogni- tive symptoms like a constant feeling of sadness, sleeplessness, and in many occasions the loss of appetite and sexual desire, and the lack of interest in common activities [10]. Anxiety is an emotional state, which causes a person to feel intense fear, uncertainty, and dread from anticipation of any given threating situation. When anxiety becomes a disorder (in difference to brief anxiety states), remains at least 6 months, and can get worse if the patient is not under treatment [10]. According to Hedayati [11] and Cukor [12], patients with CKD have a ratio of depression five times more compared with general population. The range is between 20% and 30% of 152 patients with CKD, affected with depression. To prove this numbers, this was illustrated in the analysis of 249 studies conducted by Palmer et al. [13]. The patients treated with dialysis the rate of depression was 22.8% (this by using clinical interviews). But, when the technique used for the measurement was self-rated questionnaires, the occurrences increased to 39.3%. According to a study made on 2007 [14], the prevalence rate of anxiety is estimated to be between 12% and 52%; however, there is a limited number of studies, therefore the exact rate is uncertain. As we can see, the exact rates of anxiety and depression is not well defined, but inde- pendently if whether the reality resides in the upper or lower boundaries of the rates, the levels are alarming, and there is a high correlation between CKD and these emotional states. 3.2 Algorithm Studies On a different study, multiple factors for kidney dialysis such as creatinine, sodium and urea play an important role in deciding the survival prediction of the patients [15]. Cluster- ing the information is important to identify the influence of kidney dialysis parameters. Using a simple K-means algorithm can help to determine the interaction between these pa- rameters and patient survival. Table 1 shows the range of parameters that are used for prioritization (if the patient falls into the high priority level, a kidney transplantation is needed), categorized in high, medium and low, based on the clinician feedback that was gathered from the analysis of 230 datasets taken from the Global Hospital Chennai [15]. Data mining is the process of extracting hidden information from a huge dataset. When choosing the appropriate data mining algorithms, in combination with applying a correct procedure on dialysis data set, will derive in a survival prediction of patients with CKD. On this analysis, the author used K-means. The study did not consider other parameters like chloride and bicarbonate levels, those will be analyzed in a future research. On a different research, it was learned that using a support vector machine can help a doctor to detect if a patient is showing a chronic condition or not, with an accuracy of 98.35% [16]. This technique is divided in two phases: the classification modeling (which is in charge of finding rules and model in the classification of kidney disease) and the system development (which uses the input data and applies machine-learning techniques to give a result to the doctors). The use of a selector named OneR attribute [17] which helps to extract action rules de- pending on the stages of the CKD condition, helps to prevent the advance of a chronical renal disease to beyond stages. On table 2, we can see the 5 stages and the description of each, depending on the level of GFR (glomerular filtration rate), which is the best way to measure the level of kidney function, to determine the stage of a kidney disease. A Naïve classifier uses the probability based on the Bayes theorem, with strong independence as- sumptions between the features. The results after applying this method, is the reduction of 80% of the attributes in the dataset, while improving the accuracy by 12.5%. 153 5 Table 1. Range of Parameters for prioritization decision making. Table 2. Five stages of CKD, based on glomerular filtration rate The system extracted the action rules, for any given chronical disease stage, so that the specific treatment can be taken. To avoid the CKD to advance to the next stage. Over the last decades, the use of multiple data mining techniques to investigate diseases have become essential for the health care industry, and therefore its use have exponentially increased. As we can see in Fig. 1, classification, which is the approach that assigns objects into groups that share common characteristics [18], has been the preferred method during the last 15 years to investigate breast cancer, heart and kidney disease. 154 After analyzing the behavior of two different classification techniques: Artificial Neural Network (ANN) and Naïve Bayes to predict and diagnose CKD [18], the conclusion is that Bayes is the most accurate classifier with 100% of accuracy, against a 72.73% of accuracy when using ANN. 3.3 The importance of Machine Learning Besides the fact that machine learning and big data are closely related to each other, it is important to understand about machine learning, because the algorithms commonly used by it, determine how we are going to interpret and process big data. Machine learning can be defined as the application of artificial intelligence, with the objective of providing a system the ability to automatically learn and improve from experi- ence, without the need of being explicitly programmed [19]. Almost 70 years ago, Alan Turing on his Computing Machinery and Intelligence paper [20] asked the question: can machines think? On his paper, Turing refuted multiple objections opposed to his opinion. As highlights, two of them can be addressed: 6. Theological objection: It argues that thinking is a unique function of man given by god through the soul. Hence animals and machines cannot think. Turing refuted this argu- ment stating that machines would not usurp God’s power more than humans do so when procreating children. Fig. 1. Data mining techniques used the last 15 years for disease detection 7. Mathematical objection: It argues that multiple logical mathematical results show that there are limitations to the power of discrete-state machines. Even do Turing acknowledged that there are limits to the power of machines, his said that it does not proof that the limits also apply to human intellect. 155 7 We cannot tell with 100% of certainty that a computer will be able to think, but no one has been able to prove it wrong. A couple of popular examples of machine learning which are currently used: 8. Fraud detection on credit cards: the machine learning code that is embedded to the banks systems is capable to learn and become familiar with your spending patterns, so that when something unusual is detected on your account, an alert is triggered. 9. Customer service and support: nowadays it is possible that you call a company asking for support, and you end up talking with a computer without even noticing it. This be- cause a computer can follow scripts to reply most of the questions that a regular customer might have, plus new technology that can perfectly simulate a real-life person’s voice. If we manage to properly use machine learning algorithms for analyzing and processing our data, we will have more and stronger tools to discover trends on the multiple variables that we want to consider for detecting CKD on its initial stages. 4 Future work Besides making regular invasive techniques to determine the presence of CKD on its mul- tiple stages, there are studies that look to correlate the kidney condition with other variables. We currently find efforts in iridology, emotional states (like anxiety and depression), data mining algorithms (like K-means, vector machines and Naïve Bayes). We will also deeply use other computational tools and techniques related to big data, machine learning, and signal and image processing. This research will focus in the discovery of the most adequate techniques and variables for analyzing and diagnose CKD, making special emphasis on the initial stages of the dis- ease. To achieve it, it is important to include the analysis of psychological variables and their impact on the development of the disease. We will use real data of current patients, from a public health system database. We will seek to find different ways of detecting CKD on initial stages, to develop alter- native techniques that would help to prevent a fast deterioration of the kidneys, and start treatment as early as possible to increase the life expectancy, together with a high quality of life on patients affected with the condition. It is important to highlight the fact that we will have support from urologists and other medical specialists during our investigation. 156 5 Conclusions The increase of the amount of information (especially in digital sources) developed the need to create more optimal ways to store, process and analyze data. Big Data promises to be a support for the scientific world (including medicine). Different methods are being used to show correlations of the deficiency of the kidney, with alternative methods for its detection. Studies suggests that the CKD is strongly related to other conditions like anxiety and depression. Apart from invasive classical methods, iridology and mathematical methods applied to given sets of data from patients with CKD, such as K-means, Naïve Bayes and support vector machine, helps with the classification and categorization of data for a better analysis, and therefore become a tool that support doctors on decision making when deciding the best treatment for patients with CKD. Machine learning techniques and algorithms will play an important role when deciding how to process and analyze the collected information. The use of alternative techniques for the detection of malfunctioning of the kidneys and the diagnosis of CKD on its initial stages, is not intended to replace the traditional methods, but to provide the doctors with additional and valuable information so they can take a more informed decision when deciding the best treatment for their patients. Acknowledgment The authors would like to thank the Instituto Tecnológico y de Estudios Superiores de Occidente (ITESO) of Mexico for the resources provided for this research, and to the Mex- ican National Council for Science and Technology (Consejo Nacional de Ciencia y Tecnología CONACYT) for its support thru the scholarship number 498569 assigned to the main author. References 1. V. Mayer-Schönberger and E. Inglesson, “Big data and medicine: a big deal?,” in NCBI Review Symp. Stanford, CA, USA, Jan. 2017, pp. 418-429. 2. K. Murphy, “Machine learning, a probabilistic perspective,” in The MIT Press, Massachusetts, US, Apr. 2014, pp. 62-63. 3. Press, G. (2013). A Very Short History Of Big Data. [online] Forbes. Available at: https://www.forbes.com/sites/gilpress/2013/05/09/a-very-short-history-of-big- data/#6237ca465a18 [Accessed 12 Oct. 2018]. 4. Rider, A. (1944). The scholar and the future of the research library. A problem and its solution. [Advocating the use of micro-cards.]. New York City: Hadham Press. 157 9 5. Tjomsland, I. (1980). The gap between MSS Products and User Requirements. [online] IEEE Computer Society. Available at: http://www.gbv.de/dms/tib-ub-hannover/017462509.pdf [Ac- cessed 12 Oct. 2018]. 6. Coffman, K. and Odlyzko, A. (1998). The size and growth rate of the Internet. [online] Dtc.umn.edu. Available at: http://www.dtc.umn.edu/~odlyzko/doc/internet.size.pdf [Accessed 12 Oct. 2018]. 7. Gantz, J. and Reinsel, D. (2007). The Expanding Digital Universe: A Forecast Of Worldwide Information Growth Through 2010. [online] ECM Connection. Available at: https://www.ecmconnection.com/doc/the-expanding-digital-universe-mdash-a-foreca-0001 [Accessed 1 Sep. 2018]. 8. Hilbert, M. and Lopez, P. (2011). The World’s Technological Capacity to Store, Communicate, and Compute Information. [online] MartinHilbert.net. Available at: http://www.martinhil- bert.net/WorldInfoCapacity.html/ [Accessed 1 Sep. 2018]. 9. A. Prayitno, “Early detection study of kidney organ complication caused by diabetes mellitus using iris image color constancy,” in ICTS International Conference on Information, Surabaya, Indonesia, Apr. 2017, pp. 146-149. 10. Z. Goh and K. Griva, “Anxiety and depression in patients with end-stage renal disease: impact and management challenges – a narrative review,” International Journal of Nephrology and Ren- ovascular Disease., vol. 11, pp. 93-102, Nov. 2017. 11. S. Hedayati and F. Finkelstein, “Epidemiology, diagnosis, and management of depression in pa- tients with CKD,” in American Journal of Kidney Diseases, vol. 54, pp. 741-752, July.2009. 12. D. Cukor, J. Coplan, et al. “Depression and anxiety in urban hemodialysis patients,” in Clinical Journal of American Society of Nephrology, vol. 2, pp. 484-490, May.2007. 13. S. Palmer, M. Vecchio, et al. “Prevalence of depression in chronic kidney disease: systematic review and meta-analysis of observational studies,” in Elsevier Kidney International, vol. 84, pp. 179-191, July.2013. 14. F. Murtagh, J. Hall and I. Higginson, “The prevalence of symptoms in end-stage renal disease: a systematic review,” ACKD Advances in Chronic Kidney Disease, vol. 14, pp. 82-99, Jan. 2007. 15. B. V Ravindra and N. Siraam, “Discovery of significant parameters in kidney dialysis data sets by K-means algorithm,” in IEEE International Conference on Circuits, Communication, Control and Computing, Bangalore, India, Nov. 2014, pp. 452-454. 16. M. Ahmad, V. Tundjungsari, D. Widianti, P. Amalia and U. Azizah, “Diagnostic decision sup- port system of chronic kidney disease using support vector machine,” in IEEE Second Interna- tional Conference on Informatics and Computing, Jayapura, Indonesia, Nov. 2017, pp. 1-4. 17. U. Dulhare and M. Ayesha, “Extraction of action rules for chronic kidney disease using Naïve Bayes classifier,” in IEEE International Conference on Computational Intelligence and Compu- ting Research, Chennai, India, May. 2017, pp. 1-4. 18. V. Kunwar, K.Chandel, et al. “Chronic kisney disease analysis using data mining classification techniques,” in IEEE International Conference –Cloud system and Big Data engineering, Noida, India, Jan. 2016, pp. 300-305. 19. Expert System. (2019). What is machine learning? A definition [Online]. Available: https://www.expertsystem.com/machine-learning-definition/ 20. A.Turing, “Computing Machinery and Intelligence,” Mind, vol. 59,pp. 433-460, Oct. 1950. 158