Automated processing and analysis of medical texts
                         Volodymyr Semchyshyn a , Dmytro Mykhalyk a
                         1 Ternopil Ivan Puluj National Technical University 1, Ruska str, 56, Ternopil, 46001, Ukraine


                                         Abstract

                                         This study explores the development of methods and tools for automated processing and
                                         analysis of medical texts using the Java programming language. The analysis of medical texts
                                         holds significant promise for enhancing the quality of medical diagnosis, treatment planning,
                                         and scientific research. Leveraging Java as the primary programming language enables the
                                         creation of efficient and robust tools capable of handling substantial volumes of medical data.
                                         In this paper, we conduct a comprehensive review of the known sources pertaining to
                                         automated medical text processing. We delve into the methods and technologies employed for
                                         medical text analysis, emphasizing the crucial steps of data collection and preparation for
                                         subsequent analysis.
                                         A substantial portion of work centers on the practical implementation of a Java-based system
                                         for processing and analyzing medical texts. Utilization of various text-processing libraries,
                                         machine learning, deep learning tools, and the integration of databases for the storage of
                                         medical data has been explored.
                                         The efficacy of the developed system has been assessed and compared with other methods and
                                         tools commonly used in the analysis of medical texts. The obtained results shed light on the
                                         system's performance and highlight its potential advantages.
                                         As conclusion, insights into potential avenues for future research in this vital domain has been
                                         proposed.

                                         Keywords1
                                         Medical texts, automated processing, machine learning, text classification, information
                                         extraction, clinical data


                         1. Introduction
                             Medical science and practice have always played an important role in our society, analyzing,
                         diagnosing and treating diseases, saving lives and improving the quality of people's lives. However,
                         with the advent of the digital age, information technology and computers are playing an increasingly
                         important role in supporting medical research, diagnosis and treatment. The analysis of medical texts
                         is especially important, which opens up new opportunities for improving the quality of medical care
                         and scientific research. Medical texts, such as clinical records, medical reports, morbidity statistics, and
                         other documents, contain invaluable information about patients' health, disease characteristics, test
                         results, and treatment effectiveness. However this information is usually presented in the form of text,
                         and processing and analyzing these texts manually becomes too much of a task for doctors and
                         scientists. This is where modern methods of automated processing and analysis of medical texts, based
                         on artificial intelligence and machine learning, come to the rescue. The application of these methods
                         allows to efficiently extract information from texts, classify diseases, predict risks and even
                         automatically generate medical reports.


                         Proceedings ITTAP'2023: 3rd International Workshop on Information Technologies: Theoretical and Applied Problems, November 22–24,
                         2023, Ternopil, Ukraine, Opole, Poland
                         EMAIL:vmsemchyshyn@gmail.com (A. 1); dmykhalyk@gmail.com (A. 2)
                         ORCID: 0009-0008-9206-8657 (A. 1); 0000-0001-9032-695X (A. 2)
                                      ©️ 2020 Copyright of this document belongs to its authors.
                                      Use is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.
                                      Proceedings of the CEUR workshop (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
2. Automated processing of medical texts
    One of the ways to efficiently process and analyze such volume of data is the use of automated
medical text processing systems. These systems are able to discover, collect and analyze medical
information from various sources such as electronic medical records, medical databases, scientific
publications and others.
    The main tasks of automated medical text processing systems include:
    1. Information extraction: Systems can extract key information from text documents, such as
symptoms, diagnoses, treatments, and laboratory results[1].
    2. Classification and categorization: They help to automatically classify patients by diagnosis,
severity or other parameters, which helps doctors prescribe treatment and make predictions faster.
    3. Text analysis for scientific research: Such systems can help scientists analyze scientific
publications, identify new trends and diagnostic methods[2].
    4.Monitoring of chronic diseases: Automated processing of text information can serve for constant
monitoring of patients with chronic diseases and automatic notification of medical staff about changes
in the patients' condition.


2.1.    Stages of automated processing of medical texts
    1. Collection of textual information: The first step is the collection of medical texts, which can be
obtained from various sources, such as electronic medical records, articles in medical journals,
prescriptions, test results, and other sources. This information can be presented in a variety of formats,
including text, PDF files, images, and others.
    2. Text preprocessing: Before starting the analysis, the textual information is subjected to
preprocessing. This includes cleaning the text of redundant characters, formatting, and breaking the text
into separate parts (such as sentences or words).
    3. Tokenization and lemmatization: The text is divided into separate tokens (words or phrases) so
that the computer can work with separate units. In addition, lemmatization is carried out, which consists
in reducing words to their basic form (for example, "meeting" to "meet")[1].
    4. Information extraction: One of the most important stages is the extraction of medical information
from the text. This may include identifying symptoms, diagnoses, treatments, test results, dates and
other important information.
    5. Classification and categorization: After extracting the information, the system can classify and
categorize the text data according to various parameters, for example, according to diagnoses, patient
age, type of treatment and other characteristics.


2.2.    Usage of automated processing of medical texts
   1.Electronic Medical Records (EMR): Automated medical text processing systems help doctors
quickly find the necessary information in electronic medical records, which increases the productivity
and accuracy of medical practice.
   2. Disease diagnosis and prediction: Systems can analyze a patient's medical history and scientific
data to help diagnose diseases and predict the risk of developing pathologies.
   3. Research and development of new treatment methods: Analysis of medical texts helps scientists
identify new trends and treatment methods that can improve medical practice.
   4. Monitoring of patients with chronic diseases: Automated medical text processing systems can
automatically monitor the condition of patients with chronic diseases and timely notify medical staff of
changes in their condition[5].


2.3.    Advantages of automated processing and analysis of medical texts
   1. Speed and efficiency: Automated systems can process and analyze large amounts of medical data
much faster than a human can.
   2. Accuracy: Machines have high accuracy in pattern recognition and data analysis, which helps in
improving the quality of diagnosis and treatment.
   3.Improve decisions: Automated systems can provide decision support to doctors by offering them
recommendations based on the analysis of medical data.
   4. Reducing the risk of errors: Automated data processing helps minimize human errors and
increases patient safety[1,6].


2.4.    Challenges and limitations
   Despite the potential benefits, automated processing and analysis of medical texts also faces
challenges and limitations. They include:
   1. Data confidentiality: The processing of medical data requires strict compliance with the rules of
confidentiality and protection of personal information of patients.
   2. The need for large amounts of data: Training word processing systems requires large amounts of
medical data, which can be difficult to provide.
   3. The need for collaboration with medical personnel: Physicians and other medical personnel must
be included in the process of developing and implementing systems to ensure the correct use of
technologies and evaluation of results[9].


3. Practical implementation of automated processing and analysis of medical
   texts
    The practical implementation of automated processing and analysis of medical texts has many
applications and may include the following aspects:
    1.Electronic Medical Records (EMRs) and Medical Records: These systems allow healthcare
professionals to quickly find and analyze information in patients' electronic medical records. For
example, the system can automatically highlight key data such as diagnoses, procedures, laboratory test
results, so that the doctor can make faster treatment decisions.
    2. Diagnosis of diseases and risk: Analytical systems can use medical texts to help diagnose diseases
and determine the risk of developing pathologies. For example, the system can analyze textual
information about the patient's symptoms and medical history to help the doctor make the correct
diagnosis.
    3. Scientific research and development of new treatment methods: For scientists, automated medical
text processing allows analyzing large volumes of literature and scientific publications to identify new
trends and treatment methods. For example, systems can automatically separate the results of clinical
trials from scientific articles.
    4. Monitoring of patients with chronic diseases: Automated systems can automatically monitor the
condition of patients with chronic diseases such as diabetes, cardiovascular diseases or cancer. They
can monitor changes in symptoms, treatment and test results and notify medical staff when necessary.
    5. Forecasting epidemics and public health: Analysis of textual data can be used to forecast the
spread of epidemics and public health. For example, systems can monitor media and social media posts
for signs of possible outbreaks.
    6. Automated generation of medical reports and prescriptions: Systems can automatically generate
medical reports, prescriptions and other documentation based on medical data. This reduces the time
doctors spend on documentation and allows them to focus more on patients[4,10].


3.1.    Practical implementation using the Java programming language
    Automated processing and analysis of medical texts can be implemented using the Java
programming language. Here are a few ways you can use it to practically implement this task in Java:
    1.Libraries for word processing: Java has numerous word processing libraries such as Apache
OpenNLP, Stanford NLP, and Natural Language Toolkit (NLTK) for Java. These libraries allow for
tokenization, lemmatization, entity recognition, sentence structure analysis, and much more[11].
    2.Machine Learning: Java also supports various machine learning libraries and frameworks such as
Apache Spark MLlib, Weka, and Deeplearning4j. They can be used to train machine learning models
to analyze medical texts, for example to classify texts according to diagnoses or to identify symptoms.
    3. Working with databases: Databases can be used to store and manage medical texts, such as
electronic medical records. Java supports various database management systems such as MySQL,
PostgreSQL, MongoDB, and others for storing and retrieving medical data.
    4. Web applications: Java frameworks such as Spring or Java EE can be used to create web
applications that process and analyze medical texts. This may include web services for exchanging data
with other systems or user interfaces that provide interaction with textual data.
    5. Ensuring security and privacy: Since the processing of medical data requires a high level of
security and privacy, it is important to use appropriate encryption methods and security measures that
can be easily implemented in Java.
    6. Integration with other systems: Often, medical data needs to be integrated with other systems,
such as health electronic exchange (HIE) systems or medical practice management (EHR) systems. Java
can be used to create interfaces to interact with such systems.
    In general, Java is a powerful programming language for automated medical text processing and
analysis, and can be used to create a variety of medical applications that contribute to improved
diagnosis, treatment, and scientific research[4,8].


3.2.    Usage of the Deeplearning4j framework for deep learning

   Deeplearning4j (DL4J) is a powerful machine learning and deep learning framework that can be
used for medical text processing and analysis. The results of research using Deeplearning4j can be very
diverse and depend on the specific tasks and data used to train the models. Here are some possible
research outcomes that can be achieved with DL4J in the medical field:
   1. Disease diagnosis: Using DL4J to train models that can automatically analyze medical texts (such
as examination reports or case histories) and help doctors make correct diagnoses. The result of such
research can be a model that accurately identifies diseases based on textual information.
   2. Prediction of risk and treatment: Using DL4J to analyze medical texts and predict the risk of
developing pathologies. The result can be a model that predicts the risk of certain diseases based on a
patient's medical history and other factors.
   3. Information extraction: Using DL4J to automatically extract and classify important information
from medical texts, such as symptoms, diagnoses, treatment, medical history, etc. The result could be a
system that helps doctors quickly find important information in large volumes of medical records.
   4.Text segmentation: Using DL4J to segment medical texts into separate parts or categories, such as
symptom extraction, treatment, medical history, etc. The result could be a program that makes it easier
for doctors to analyze medical records.
   5. Automatic generation of reports and recommendations: Using DL4J to automatically generate
medical reports based on medical data analysis. The result could be a system that generates reports on
patient conditions and recommendations for doctors.
   6. Monitoring and analyzing changes in patients with chronic diseases: Using DL4J to monitor
patients with chronic diseases based on the analysis of their medical texts. The result can be a system
that detects changes in the patient's condition in a timely manner and notifies the medical staff.
   These are just a few possible areas of research that can be conducted using Deeplearning4j in the
medical field. Research results will depend on the specific task, data and quality of machine learning
models used in the process of analyzing medical texts[12].


3.2.1. Research results using Deeplearning4j
   The results of research on automated processing and analysis of medical texts when using
Deeplearning4j will depend on the specific tasks you perform, as well as on the volume and quality of
available data. As a rule, the accuracy of the results depends on the amount of data used to train the
model.
   In this example (table 1), we can use Deeplearning4j to train a model to classify the text of medical
reports based on the patient's diagnosis.

Table 1
Classification of texts by diagnoses
                 Amount of training                        Accuracy of the result
                    100 samples                                      75%
                    500 samples                                      85%
                   1000 samples                                      90%
                   5000 samples                                      95%

   In this example (table 2), we can use Deeplearning4j to create a model that automatically extracts
symptom information from medical texts.

Table 2
Extracting information from medical texts
                Amount of training                         Accuracy of the result
                   200 samples                                     70%
                  1000 samples                                     80%
                  5000 samples                                     90%
                 10,000 samples                                    95%

  In this example (table 3), we can use Deeplearning4j to create a model that automatically generates
medical reports based on patient data.

Table 3
Generation of medical reports
              Amount of training                           Accuracy of the result
                    300 samples                                      60%
                   1000 samples                                      75%
                   5000 samples                                      85%
                  10,000 samples                                     90%


   Overall, the table of the relationship between the amount of training and the accuracy of the result
demonstrates that increasing the amount of training usually leads to improved results, but this may also
depend on the complexity of the task and the quality of the data. In order to achieve better results, it is
important to select and prepare the relevant data and properly configure the parameters of the Deep
Learning model[7].


4. Conclusions
   In this work, were researched methods and tools for automated processing and analysis of medical
texts using the Java programming language. The importance of automated medical text processing was
highlighted. Analysis of medical texts is a critically important task in medical research and practice. It
helps detect diseases, predict risks and improve medical diagnosis. Methods and technologies such as
natural language processing (NLP), machine learning, and deep learning that can be used to automate
the analysis of medical texts are reviewed. They help classify diseases, highlight key information and
automatically generate reports. Before practical implementation, an important stage in working with
medical texts is the collection and preparation of data. This includes sanitization, tokenization, and other
text processing techniques.
   A medical text processing system was developed in the Java programming language, which provided
a wide range of libraries, tools, and frameworks for implementing complex text processing tasks.
Conducted experiments to assess its effectiveness. The results showed that the automated processing of
medical texts can significantly improve the quality of diagnosis and patient care.
   Further research in this area may include expanding the methods of medical text analysis to take into
account new data and standards. It is also possible to develop decision support systems in medicine
based on text information processing.
   In general, this work demonstrates the importance and prospects of using the Java programming
language for automated processing and analysis of medical texts. It opens up new opportunities for
improving medical practice and contributes to the development of medical science.


5. References

[1] Jurafsky, D., Martin, J. H. (2020). "Speech and Language Processing: An Introduction to Natural
     Language Processing, Computational Linguistics, and Speech Recognition." Pearson.
[2] Manning, CD, Raghavan, P., & Schütze, H. (2008). "Introduction to Information Search".
     Cambridge University Press.
[3] Byrd, S., Klein, E., & Loper, E. (2009). "Natural Language Processing with Python". O'Reilly
     Media.
[4] Scholle, F. (2017). "Deep Learning with Python". Manning Publications.
[5] Rajkomar, A., Oren, E., Chen, K., Dai, A.M., Hajjai, N., Hardt, M., ... and Dean, J. (2018).
     "Scalable and accurate deep learning with electronic medical records". npj Digital Medicine, 1(1),
     1-10.
[6] Luo, Y., Yang, J., & Uzuner, O. (2017). "Improving Clinical Concept Extraction Using Contextual
     Embedding". Journal of Biomedical Informatics, 75, S41-S47.
[7] Johnson, A.E., Pollard, T.J., Shen, L., Li-wei, H.L., Feng, M., Ghasemi, M., ... and Seely,
     Louisiana (2016). "MIMIC-III, an open-access intensive care database." Scientific information, 3,
     1-9.
[8] Soysal, E., Wang, J., Jiang, M., Wu, Y., Pakhomov, S., Liu, H., & Xu, H. (2018). "CLAMP is a
     set of tools for efficiently building customized clinical natural language processing pipelines."
     Journal of the American Medical Informatics Association, 25(3), 331-336.
[9] Carrell, DS, Shen, RE, Leffler, DA, Morris, M., Rose, S., Behr, A., ... & Kappelman, MD (2015).
     "Problems in adapting existing clinical natural language processing systems to various health care
     institutions." Journal of the American Medical Informatics Association, 22(4), 882-888.
[10] Manning, K. D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S. J., & McCloskey, D. (2014). "The
     Stanford CoreNLP Natural Language Processing Toolkit". In Proceedings of the 52nd Annual
     Meeting of the Association for Computational Linguistics: System Demonstrations (pp. 55-60).
[11] Apache OpenNLP. URL: https://opennlp.apache.org/
[12] Deeplearning4j. URL: https://deeplearning4j.konduit.ai/