Automated processing and analysis of medical texts Volodymyr Semchyshyn a , Dmytro Mykhalyk a 1 Ternopil Ivan Puluj National Technical University 1, Ruska str, 56, Ternopil, 46001, Ukraine Abstract This study explores the development of methods and tools for automated processing and analysis of medical texts using the Java programming language. The analysis of medical texts holds significant promise for enhancing the quality of medical diagnosis, treatment planning, and scientific research. Leveraging Java as the primary programming language enables the creation of efficient and robust tools capable of handling substantial volumes of medical data. In this paper, we conduct a comprehensive review of the known sources pertaining to automated medical text processing. We delve into the methods and technologies employed for medical text analysis, emphasizing the crucial steps of data collection and preparation for subsequent analysis. A substantial portion of work centers on the practical implementation of a Java-based system for processing and analyzing medical texts. Utilization of various text-processing libraries, machine learning, deep learning tools, and the integration of databases for the storage of medical data has been explored. The efficacy of the developed system has been assessed and compared with other methods and tools commonly used in the analysis of medical texts. The obtained results shed light on the system's performance and highlight its potential advantages. As conclusion, insights into potential avenues for future research in this vital domain has been proposed. Keywords1 Medical texts, automated processing, machine learning, text classification, information extraction, clinical data 1. Introduction Medical science and practice have always played an important role in our society, analyzing, diagnosing and treating diseases, saving lives and improving the quality of people's lives. However, with the advent of the digital age, information technology and computers are playing an increasingly important role in supporting medical research, diagnosis and treatment. The analysis of medical texts is especially important, which opens up new opportunities for improving the quality of medical care and scientific research. Medical texts, such as clinical records, medical reports, morbidity statistics, and other documents, contain invaluable information about patients' health, disease characteristics, test results, and treatment effectiveness. However this information is usually presented in the form of text, and processing and analyzing these texts manually becomes too much of a task for doctors and scientists. This is where modern methods of automated processing and analysis of medical texts, based on artificial intelligence and machine learning, come to the rescue. The application of these methods allows to efficiently extract information from texts, classify diseases, predict risks and even automatically generate medical reports. Proceedings ITTAP'2023: 3rd International Workshop on Information Technologies: Theoretical and Applied Problems, November 22–24, 2023, Ternopil, Ukraine, Opole, Poland EMAIL:vmsemchyshyn@gmail.com (A. 1); dmykhalyk@gmail.com (A. 2) ORCID: 0009-0008-9206-8657 (A. 1); 0000-0001-9032-695X (A. 2) ©️ 2020 Copyright of this document belongs to its authors. Use is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license. Proceedings of the CEUR workshop (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 2. Automated processing of medical texts One of the ways to efficiently process and analyze such volume of data is the use of automated medical text processing systems. These systems are able to discover, collect and analyze medical information from various sources such as electronic medical records, medical databases, scientific publications and others. The main tasks of automated medical text processing systems include: 1. Information extraction: Systems can extract key information from text documents, such as symptoms, diagnoses, treatments, and laboratory results[1]. 2. Classification and categorization: They help to automatically classify patients by diagnosis, severity or other parameters, which helps doctors prescribe treatment and make predictions faster. 3. Text analysis for scientific research: Such systems can help scientists analyze scientific publications, identify new trends and diagnostic methods[2]. 4.Monitoring of chronic diseases: Automated processing of text information can serve for constant monitoring of patients with chronic diseases and automatic notification of medical staff about changes in the patients' condition. 2.1. Stages of automated processing of medical texts 1. Collection of textual information: The first step is the collection of medical texts, which can be obtained from various sources, such as electronic medical records, articles in medical journals, prescriptions, test results, and other sources. This information can be presented in a variety of formats, including text, PDF files, images, and others. 2. Text preprocessing: Before starting the analysis, the textual information is subjected to preprocessing. This includes cleaning the text of redundant characters, formatting, and breaking the text into separate parts (such as sentences or words). 3. Tokenization and lemmatization: The text is divided into separate tokens (words or phrases) so that the computer can work with separate units. In addition, lemmatization is carried out, which consists in reducing words to their basic form (for example, "meeting" to "meet")[1]. 4. Information extraction: One of the most important stages is the extraction of medical information from the text. This may include identifying symptoms, diagnoses, treatments, test results, dates and other important information. 5. Classification and categorization: After extracting the information, the system can classify and categorize the text data according to various parameters, for example, according to diagnoses, patient age, type of treatment and other characteristics. 2.2. Usage of automated processing of medical texts 1.Electronic Medical Records (EMR): Automated medical text processing systems help doctors quickly find the necessary information in electronic medical records, which increases the productivity and accuracy of medical practice. 2. Disease diagnosis and prediction: Systems can analyze a patient's medical history and scientific data to help diagnose diseases and predict the risk of developing pathologies. 3. Research and development of new treatment methods: Analysis of medical texts helps scientists identify new trends and treatment methods that can improve medical practice. 4. Monitoring of patients with chronic diseases: Automated medical text processing systems can automatically monitor the condition of patients with chronic diseases and timely notify medical staff of changes in their condition[5]. 2.3. Advantages of automated processing and analysis of medical texts 1. Speed and efficiency: Automated systems can process and analyze large amounts of medical data much faster than a human can. 2. Accuracy: Machines have high accuracy in pattern recognition and data analysis, which helps in improving the quality of diagnosis and treatment. 3.Improve decisions: Automated systems can provide decision support to doctors by offering them recommendations based on the analysis of medical data. 4. Reducing the risk of errors: Automated data processing helps minimize human errors and increases patient safety[1,6]. 2.4. Challenges and limitations Despite the potential benefits, automated processing and analysis of medical texts also faces challenges and limitations. They include: 1. Data confidentiality: The processing of medical data requires strict compliance with the rules of confidentiality and protection of personal information of patients. 2. The need for large amounts of data: Training word processing systems requires large amounts of medical data, which can be difficult to provide. 3. The need for collaboration with medical personnel: Physicians and other medical personnel must be included in the process of developing and implementing systems to ensure the correct use of technologies and evaluation of results[9]. 3. Practical implementation of automated processing and analysis of medical texts The practical implementation of automated processing and analysis of medical texts has many applications and may include the following aspects: 1.Electronic Medical Records (EMRs) and Medical Records: These systems allow healthcare professionals to quickly find and analyze information in patients' electronic medical records. For example, the system can automatically highlight key data such as diagnoses, procedures, laboratory test results, so that the doctor can make faster treatment decisions. 2. Diagnosis of diseases and risk: Analytical systems can use medical texts to help diagnose diseases and determine the risk of developing pathologies. For example, the system can analyze textual information about the patient's symptoms and medical history to help the doctor make the correct diagnosis. 3. Scientific research and development of new treatment methods: For scientists, automated medical text processing allows analyzing large volumes of literature and scientific publications to identify new trends and treatment methods. For example, systems can automatically separate the results of clinical trials from scientific articles. 4. Monitoring of patients with chronic diseases: Automated systems can automatically monitor the condition of patients with chronic diseases such as diabetes, cardiovascular diseases or cancer. They can monitor changes in symptoms, treatment and test results and notify medical staff when necessary. 5. Forecasting epidemics and public health: Analysis of textual data can be used to forecast the spread of epidemics and public health. For example, systems can monitor media and social media posts for signs of possible outbreaks. 6. Automated generation of medical reports and prescriptions: Systems can automatically generate medical reports, prescriptions and other documentation based on medical data. This reduces the time doctors spend on documentation and allows them to focus more on patients[4,10]. 3.1. Practical implementation using the Java programming language Automated processing and analysis of medical texts can be implemented using the Java programming language. Here are a few ways you can use it to practically implement this task in Java: 1.Libraries for word processing: Java has numerous word processing libraries such as Apache OpenNLP, Stanford NLP, and Natural Language Toolkit (NLTK) for Java. These libraries allow for tokenization, lemmatization, entity recognition, sentence structure analysis, and much more[11]. 2.Machine Learning: Java also supports various machine learning libraries and frameworks such as Apache Spark MLlib, Weka, and Deeplearning4j. They can be used to train machine learning models to analyze medical texts, for example to classify texts according to diagnoses or to identify symptoms. 3. Working with databases: Databases can be used to store and manage medical texts, such as electronic medical records. Java supports various database management systems such as MySQL, PostgreSQL, MongoDB, and others for storing and retrieving medical data. 4. Web applications: Java frameworks such as Spring or Java EE can be used to create web applications that process and analyze medical texts. This may include web services for exchanging data with other systems or user interfaces that provide interaction with textual data. 5. Ensuring security and privacy: Since the processing of medical data requires a high level of security and privacy, it is important to use appropriate encryption methods and security measures that can be easily implemented in Java. 6. Integration with other systems: Often, medical data needs to be integrated with other systems, such as health electronic exchange (HIE) systems or medical practice management (EHR) systems. Java can be used to create interfaces to interact with such systems. In general, Java is a powerful programming language for automated medical text processing and analysis, and can be used to create a variety of medical applications that contribute to improved diagnosis, treatment, and scientific research[4,8]. 3.2. Usage of the Deeplearning4j framework for deep learning Deeplearning4j (DL4J) is a powerful machine learning and deep learning framework that can be used for medical text processing and analysis. The results of research using Deeplearning4j can be very diverse and depend on the specific tasks and data used to train the models. Here are some possible research outcomes that can be achieved with DL4J in the medical field: 1. Disease diagnosis: Using DL4J to train models that can automatically analyze medical texts (such as examination reports or case histories) and help doctors make correct diagnoses. The result of such research can be a model that accurately identifies diseases based on textual information. 2. Prediction of risk and treatment: Using DL4J to analyze medical texts and predict the risk of developing pathologies. The result can be a model that predicts the risk of certain diseases based on a patient's medical history and other factors. 3. Information extraction: Using DL4J to automatically extract and classify important information from medical texts, such as symptoms, diagnoses, treatment, medical history, etc. The result could be a system that helps doctors quickly find important information in large volumes of medical records. 4.Text segmentation: Using DL4J to segment medical texts into separate parts or categories, such as symptom extraction, treatment, medical history, etc. The result could be a program that makes it easier for doctors to analyze medical records. 5. Automatic generation of reports and recommendations: Using DL4J to automatically generate medical reports based on medical data analysis. The result could be a system that generates reports on patient conditions and recommendations for doctors. 6. Monitoring and analyzing changes in patients with chronic diseases: Using DL4J to monitor patients with chronic diseases based on the analysis of their medical texts. The result can be a system that detects changes in the patient's condition in a timely manner and notifies the medical staff. These are just a few possible areas of research that can be conducted using Deeplearning4j in the medical field. Research results will depend on the specific task, data and quality of machine learning models used in the process of analyzing medical texts[12]. 3.2.1. Research results using Deeplearning4j The results of research on automated processing and analysis of medical texts when using Deeplearning4j will depend on the specific tasks you perform, as well as on the volume and quality of available data. As a rule, the accuracy of the results depends on the amount of data used to train the model. In this example (table 1), we can use Deeplearning4j to train a model to classify the text of medical reports based on the patient's diagnosis. Table 1 Classification of texts by diagnoses Amount of training Accuracy of the result 100 samples 75% 500 samples 85% 1000 samples 90% 5000 samples 95% In this example (table 2), we can use Deeplearning4j to create a model that automatically extracts symptom information from medical texts. Table 2 Extracting information from medical texts Amount of training Accuracy of the result 200 samples 70% 1000 samples 80% 5000 samples 90% 10,000 samples 95% In this example (table 3), we can use Deeplearning4j to create a model that automatically generates medical reports based on patient data. Table 3 Generation of medical reports Amount of training Accuracy of the result 300 samples 60% 1000 samples 75% 5000 samples 85% 10,000 samples 90% Overall, the table of the relationship between the amount of training and the accuracy of the result demonstrates that increasing the amount of training usually leads to improved results, but this may also depend on the complexity of the task and the quality of the data. In order to achieve better results, it is important to select and prepare the relevant data and properly configure the parameters of the Deep Learning model[7]. 4. Conclusions In this work, were researched methods and tools for automated processing and analysis of medical texts using the Java programming language. The importance of automated medical text processing was highlighted. Analysis of medical texts is a critically important task in medical research and practice. It helps detect diseases, predict risks and improve medical diagnosis. Methods and technologies such as natural language processing (NLP), machine learning, and deep learning that can be used to automate the analysis of medical texts are reviewed. They help classify diseases, highlight key information and automatically generate reports. Before practical implementation, an important stage in working with medical texts is the collection and preparation of data. This includes sanitization, tokenization, and other text processing techniques. A medical text processing system was developed in the Java programming language, which provided a wide range of libraries, tools, and frameworks for implementing complex text processing tasks. Conducted experiments to assess its effectiveness. The results showed that the automated processing of medical texts can significantly improve the quality of diagnosis and patient care. Further research in this area may include expanding the methods of medical text analysis to take into account new data and standards. It is also possible to develop decision support systems in medicine based on text information processing. In general, this work demonstrates the importance and prospects of using the Java programming language for automated processing and analysis of medical texts. It opens up new opportunities for improving medical practice and contributes to the development of medical science. 5. References [1] Jurafsky, D., Martin, J. H. (2020). "Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition." Pearson. [2] Manning, CD, Raghavan, P., & Schütze, H. (2008). "Introduction to Information Search". Cambridge University Press. [3] Byrd, S., Klein, E., & Loper, E. (2009). "Natural Language Processing with Python". O'Reilly Media. [4] Scholle, F. (2017). "Deep Learning with Python". Manning Publications. [5] Rajkomar, A., Oren, E., Chen, K., Dai, A.M., Hajjai, N., Hardt, M., ... and Dean, J. (2018). "Scalable and accurate deep learning with electronic medical records". npj Digital Medicine, 1(1), 1-10. [6] Luo, Y., Yang, J., & Uzuner, O. (2017). "Improving Clinical Concept Extraction Using Contextual Embedding". Journal of Biomedical Informatics, 75, S41-S47. [7] Johnson, A.E., Pollard, T.J., Shen, L., Li-wei, H.L., Feng, M., Ghasemi, M., ... and Seely, Louisiana (2016). "MIMIC-III, an open-access intensive care database." Scientific information, 3, 1-9. [8] Soysal, E., Wang, J., Jiang, M., Wu, Y., Pakhomov, S., Liu, H., & Xu, H. (2018). "CLAMP is a set of tools for efficiently building customized clinical natural language processing pipelines." Journal of the American Medical Informatics Association, 25(3), 331-336. [9] Carrell, DS, Shen, RE, Leffler, DA, Morris, M., Rose, S., Behr, A., ... & Kappelman, MD (2015). "Problems in adapting existing clinical natural language processing systems to various health care institutions." Journal of the American Medical Informatics Association, 22(4), 882-888. [10] Manning, K. D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S. J., & McCloskey, D. (2014). "The Stanford CoreNLP Natural Language Processing Toolkit". In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations (pp. 55-60). [11] Apache OpenNLP. URL: https://opennlp.apache.org/ [12] Deeplearning4j. URL: https://deeplearning4j.konduit.ai/