Methodology for Preprocessing Semi-Structured Data for
   Making Managerial Decisions in the Healthcare* 1
        Elena Makarova[0000-0002-5410-5890] and Dmitriy Lagerev[0000-0002-2702-6492]

                    Bryansk State Technical University, Bryansk, Russia
                  lennymbear@gmail.com, LagerevDG@mail.ru


       Abstract. This paper describes the process of supporting management decision-
       making in healthcare based on data mining. The authors described various prob-
       lems and specifics of data in medical information systems, leading to the com-
       plexity of their analysis and integration, such as: the presence of a large number
       of specific abbreviations, errors in the data and their poor structure. The paper
       demonstrates an approach to the search and further disclosure of abbreviations in
       texts, built on a combination of machine and human processing. A method for
       extracting features from semi-structured fields using an expert in the subject area
       and using various visualizations is proposed. The proposed abbreviation search
       and disclosure methods, based on a hybrid approach combining the strengths of
       processing with the help of a machine and an expert, can increase the number of
       abbreviations found automatically and significantly reduce the time spent by ex-
       perts on processing the remaining reductions. In addition, the method for auto-
       mated feature extraction during integration can significantly increase the amount
       of useful input data, while reducing the time of the expert.

       Keywords: Natural Languages Processing, Data Integration, Healthcare.


1      Introduction

The digitalization of Russian medicine poses new challenges for managers and engi-
neers - implementation, security and support issues, large data storage and processing
systems. But the collection of this data in digital format, in turn, opens up new oppor-
tunities for researchers and healthcare managers through the use of data analysis tech-
nologies.
   Over the years of informatization of various cities and regions of the Russian Feder-
ation, in medical information systems (hereinafter referred to as MIS), more and more
data has been accumulating on various aspects of the work of medical organizations -
from medical histories and prescriptions of specific patients to various aspects related
to providing medical institutions with necessary medicines and supplies materials.


Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0).

* The reported study was funded by RFBR, project № 20-04-60185.
2 E. Makarova, D. Lagerev


    Thus, the improvement of analysis technologies and the filling of databases with
medical data makes it relevant to use data mining for tasks related to healthcare man-
agement [1], such as: planning material and human resources in healthcare, forecasting
statistical indicators, optimal and timely provision of resources, tracing outbreaks and
spreading of diseases.
    These processes explain the relevance of solving the problem of creating an auto-
mated system to support managerial decision-making in healthcare and solving side
problems associated with the implementation of this system.
    The emphasis is on the need for both strategic (development of the healthcare sector
in the region) and operational (decisions at the level of a medical organization, response
to outbreaks of diseases, etc.)


2      Management decision-making in healthcare

The forecasting task is very relevant for healthcare: it is necessary to predict the inci-
dence rate, assess the required resources to maintain the effective operation of the sys-
tem, etc. Researchers devote much attention to the problems of disease prognosis in
specific regions [2].
   Forecasting is a task that can be solved in many ways - from classical statistical
methods to models based on machine learning technology. Recently, the neural network
approach has become widespread. [3]. For example, recent research show that the ac-
curacy of predicting many diseases using CNN is greater than, for example, the KNN
and NB algorithms. [4].
   In addition, a research that used deep learning to process textual medical data (vari-
ous models of embedded representations of words were used) showed an increase in
forecasting accuracy [5]. To predict a number of diseases, classic time series are best
used, using only numerical values (for example, injuries, SARS, etc.). However, in
which groups of diseases it is necessary to apply more complex approaches to the anal-
ysis. For example, when predicting malignant neoplasms, it is necessary to understand
not only the general characteristics, but also the number of patients at different stages
of cancer. In a research on the prediction of breast cancer using machine learning, pa-
tients were divided into cohorts depending on the stage of the disease and other param-
eters, which allowed better identification of factors contributing to patient survival [6].
   To make a decision on the distribution of resources between various medical insti-
tutions, the regional health department needs to make a forecast about the development
of certain diseases and act according to long-term planning. The process of creating
such a forecast can simplify the development of an automated system to support man-
agement decisions. (Fig 1)
   The integration and setting up of this process requires the investment of certain hu-
man resources, however, given the need to regularly make such management decisions
and constant updating of data in the regional information system (RIS), in the long term,
these labor costs will be justified. The general scheme of this process is shown in Fig-
ure 2.
   Methodology for Preprocessing Semi-Structured Data for Making Managerial Decisions… 3


                   Fig. 1. The management decision-making in healthcare

  Health resource management conceptual model:

                                S = < R, M, D, Z; I>,                                  (1)

   where: R - available resources (budget for all medical institutions);
   M - budget for a specific medical organization (MO) for various articles (equipment,
maintenance of an inpatient hospital, procurement of medicines, rates of health work-
ers, etc.);
   D - effectiveness of the development of this budget;
   Z - requests for resources (the current need for various MO in them);
   I - information available for analysis and forecasting resource requirements.
   The most time-consuming step in data mining is still the process of collecting, clean-
ing and pre-processing data before analysis. According to various researchers, this pro-
cess takes from 60% to 80% of the time [7].
   In previous works of the authors [8] concerning the process of collecting and pro-
cessing semi-structured data, much attention was paid to the “hybrid” approach in the
field of developing data analysis systems. In this approach, human expertise is used in
conjunction with automatic analysis methods, which allows, on the one hand, to im-
prove the quality of the system on tasks that cannot be solved without human interven-
tion, and on the other, to relieve the expert from solving typical, routine tasks. This was
achieved using various methodologies for pre-processing and data visualization, which
helped the expert make faster decisions about the inclusion / exclusion of a particular
data source [8].
   Data collection is only the first step in the data preparation process for use in ensem-
bles of data mining models. In addition to the general problems for all subject areas
arising at the stage of data preprocessing, when constructing analytical models for the
analysis of biomedical data, researchers and developers encounter a number of prob-
lems specific to the described data, which will be discussed in more detail in the next
section.
4 E. Makarova, D. Lagerev


          Fig. 2. The general scheme of management decision-making in healthcare


3      Pre-processing of medical data for analysis

Sample Heading The most common methods for implementing data integration are:
        1) file-based sharing;
        2) data replication;
        3) Web services technology;
        4) Service Oriented Architecture (SOA);
        5) integration servers [9].
   In the USISH project (Unified State Information System in Healthcare), an docu-
ment-based approach is proposed as the main method of integration between different
RIS [10]. Because Since various medical documents are poorly structured, we are faced
with the problem of correlating various data. For example, processing the parameters
“general patient indicators”, which may be different depending on the medical institu-
tion - weight, height, blood pressure, blood glucose, etc. Resolving such ambiguities is
    Methodology for Preprocessing Semi-Structured Data for Making Managerial Decisions… 5


also an important part of solving the integration problem. For example, the same indi-
cator in one database is called “growth”, and in another - “body length”.
   The problem of poorly structured data is that the ICD-10 classification does not con-
tain details on the diagnosis, doctors must fill out this information on their own accord-
ing to pre-created forms in the system or in free form.
   Also in the field of free entry is usually indicated: degree (stage) of the disease, form,
prescribed medications and other. For example, regarding the classification of malig-
nant neoplasms according to ICD-10, the classifier does not reflect the stage of the
disease. Usually it is indicated in natural language in another field. However, the avail-
ability of these data could constitute a better prognostic model for the stages of the
disease. So, for example, for patients with a malignant formation of the first stage, there
is a significant risk under certain circumstances, an article by patients with a second
stage of cancer, second - third are at the second - go to the third, etc.
    Here are a few examples of such an ICD-10 diagnosis uncertainty. For example, the
ICD-10 code “S82.6” (fracture of lateral malleolus) should have at least an explanation
of the right leg or left leg (which can be expressed in free form as «левая», «слева»,
«левый», «лев», «л.», etc.), but also an indication of whether the fracture is closed or
not, complete or incomplete.
   For example, Table 1 presents some examples of how certain important terms are
indicated. In addition, in some cases it is not clear that this is an abbreviation, a specific
term or word spelled out with errors.
   The database also contains specific grammatical constructions that make it difficult
to extract features. For example, when identifying symptoms by standard methods, the
phrase from the patient’s history “did not have hepatitis, tuberculosis”, information
about hepatitis and tuberculosis without mentioning “not sick” could fall into the pa-
tient’s model, which would worsen the quality of the prediction models.

                                 Table 1. Examples of data

 Abbreviation           Whole word                    Close tokens
 “отр”                  отрицательный                 “отриц”
 “хр"                   хронический                   “хрон”, “хроничекий”
 “бер”                  беременность                  “бер-ть”, “бер-сть”, “берем’
 “нед”                  неделя                        “ндл”, “неделль”
 “отр”                  отрицательный                 “отриц”


4        A hybrid approach to finding and revealing abbreviations,
         incorrect spelling of words

Based on the available data, various approaches to finding abbreviations have been
tried, from a standard approach based on regular expressions and a dictionary of com-
monly used abbreviations. Since many abbreviations in the sample are specific, the
6 E. Makarova, D. Lagerev


combined methods based on a combination of heuristic, vocabulary, and statistical ap-
proaches gave the greatest increase in accuracy. A detailed description of this approach
and the results of its use are described in a previous work of the authors [12].
    However, it is not yet possible to reveal specific abbreviations absolutely precisely
in a fully automated mode. One way or another, when solving this problem, you will
have to turn to knowledge by a competent person. By analyzing the context of abbre-
viations, it is possible to significantly reduce the degree of expert intervention, if we
train the word embedding model on the available data and predict the overall meaning
of abbreviations depending on the context. There is enough data when training the
model to establish syntagmatic and paradigmatic relationships.
    The semantic similarity between linguistic units is calculated as the distance between
vectors. In studies on distributive semantics, the most often used cosine measure, which
is calculated by the formula
                                              n

                                              (A  B )
                                             i =1
                                                            i          i
                              sim =                                                ,
                                       n                         n

                                       ( A )   (B )
                                                        2                      2
                                                    i                      i
                                      i =1                      i =1
                                                                                        (2)

   where A and B are the vectors of words, the distance between which is calculated.
   In this study, vectorization was implemented using the Bag of Words method [13],
but other methods are also possible. For this sample, a sufficiently high threshold is a
value of 0.7

                                      sim ≥ 0,7                                         (3)

   In addition, in order to automatically recognize the word appropriate context abbre-
viation, three conditions are necessary:
   the abbreviation coincides with the beginning of a semantically close word;
   the found word is not an abbreviation and is present in the dictionary of used words
of the Russian language;
   the found word is the only one satisfying the first and second conditions in the range
of semantic proximity from 0.7 to 0.99.
   For words that do not coincide with the abbreviation but are close in terms of the
cosine measure, the Tanimoto coefficient is used with a match value greater than or
equal to 0.5 [14]. For example, we calculate the syntactic proximity of the abbreviation
"стд" and the word "стадия":
                                                              c
                                             k=
                                                            a+b−c ,                    (4)

   where a, b are the number of elements in the “стд” token and the “стадия” token,
respectively;
   c is the number of common elements in the "std" and "stage" tokens.
   In the standard setting, k is taken to be large 0.5 for words with a non-matching
beginning and ending (“стд” and “стадия”) and 0.35 for words with a matching begin-
ning and ending and containing a hyphen (“бр-ть” and “беременность”). In previous
   Methodology for Preprocessing Semi-Structured Data for Making Managerial Decisions… 7


work of the authors was presented word embedding visualization technique for these
tasks analytics, which was also implemented in this case.
   Of the available sample of depersonalization records from integrated electronic
medical records (IEMR) of approximately 1.4 million residents of the Bryansk region,
a sample of 60,000 records was created, balanced by diagnosis and length of text de-
scription, of which 3,000 records were similarly selected. Each of abbreviations was
manually specified to verify the developed methodology. Next, a comparison was made
of the results of a fully manual approach, a fully automated and the hybrid approach
described above. Results are presented in table 2.

                        Table 2. Search and disclosure of abbreviations

                            Processing time 60,000      Number of found and disclosed abbre-
 Approach
                            records                     viations
 Fully manual filling       About 410 hours             Close to 100%
 Fully automated ap-
                            5 to 10 minutes*            Up to 53%
 proach
 Hybrid approach            20 to 67 minutes *          83-90%

The interface of a specialist in the subject area for marking up data when implementing
manual (expert) control is presented in Figure 3. The proposed approach checks not
only all cases that do not go beyond the boundaries of automatic marking, but also 5-
10% of instances automatically classified by the system in order to verify the correct
operation of the algorithms and their settings, if necessary. The large spread in labor
costs for the expert when marking up the data is explained by the number of checked
examples and the severity of the thresholds for automatic marking depends enough. In
the described experiment, the choice of the percentage of data considered by experts
and trusted by the system depends on the accuracy requirements and is limited by avail-
able resources.


          Fig. 3. The interface of a specialist in the subject area for marking up data
8 E. Makarova, D. Lagerev


5      Visual interface for data extraction

The solution to the problem of bringing various abbreviations with a single value is
only one way to reduce the number and improve the quality of features that are input to
analytical models. Using a large number of features (some of which will be duplicate,
some will be useless) is irrational. There are problems such as overfitting, an increase
in processing time and the presence of "noise" and "garbage data".
   In addition to selecting features that will go into the model, they often need to be
additionally extracted from poorly structured data. In this case, we also use a hybrid
approach (Fig. 4).
   In this approach, word2vec models are used (to select contextually close tokens) and
a visual editor (Fig. 5). The main visualization metric here is coverage of features. It is
calculated on a weighted limited sample. In this case, 3000 records to quickly recount
the results and provide interactive visualizations.
   As the main extraction method used rules based on the principles of regular expres-
sions. A user who is an expert in this field, but does not understand regular expressions
and word processing, is provided with a visual editor of these expressions and instruc-
tions for use. After re-calculation, several random entries are also presented for manual
evaluation of the created rules. The results of these evaluations are saved for automatic
validation when the rules change.
   Table 3 presents the results of an experiment conducted on a sample of oncological
diagnoses, where important metadata, such as the "stage", were described in the free
entry fields.


                       Fig. 4. Extraction of features: hybrid approach
    Methodology for Preprocessing Semi-Structured Data for Making Managerial Decisions… 9


                        Table 3. Feature extraction in database integration.

                              Processing time 9,000
 Approach                                                 Number of extracted features
                              records
 Fully manual filling         About 18 hours              Close to 100%
 Fully automated ap-          5 to 10 minutes *
                                                          About 40%
 proach
 Hybrid approach              30 to 120 minutes *         Up to 90%


6       Conclusions

To effectively manage healthcare resources, it is necessary to collect, save and analyze
data received from all regions of the Russian Federation. At the moment, one of the
main methods of data integration in the USISH project is integration through docu-
ments. Since documents are a poorly structured source of information, with such inte-
gration there are problems associated with the interpretation of various text data, as well
as problems of their quality: the presence of specific abbreviations, errors, difficulties
in extracting various features, etc. The presence of a large number of noise, duplicates,
and incorrect features degrades the quality of data analysis models.
   The proposed abbreviation search and disclosure methods, based on a hybrid ap-
proach combining the strengths of processing with the help of a machine and an expert,
can increase the number of abbreviations found automatically by 21%, as well as detect
in automated mode up to 55% of cases (with a probability of correctness higher 70%)
and significantly reduce the time spent by experts on processing the remaining reduc-
tions.
   In addition, the method for automated feature extraction during integration can sig-
nificantly increase the amount of useful input data, while reducing the time of the ex-
pert.
   Using a hybrid approach to preprocessing poorly structured data increases the effi-
ciency of managerial decisions in the field of healthcare by increasing the reliability of
data mining models and reducing the time spent by experts on their creation and sup-
port. A further line of work in this area will be directed to the development of methods
for the semi-automatic selection of features for analytical models.
10 E. Makarova, D. Lagerev


                          Fig. 5. The interface for features extraction


References
 1. Zakharova, A.A., Lagerev, D. G., Podvesovskii, A. G.: Multi-level Model for Structuring
    Heterogeneous Biomedical Data in the Tasks of Socially Significant Diseases Risk Evalua-
    tion. In: 3rd Conference on Creativity in Intelligent Technologies and Data Science, CIT and
    DS 2019, pp. 461-473, Volgograd (2019)
 2. Choporov, O.N., Zolotuhin, O.V., Bolgov, S.V.: Algoritmizaciya intellektual'nogo analiza
    dannyh o rasprostranennosti zabolevanij na regional'nom i municipal'nom urovnyah. In:
    Modelirovanie, optimizaciya i informacionnye tekhnologii № 2 (9), (2015)
 3. Lazarenko, V.A., Antonov, A.E.: Diagnostika i prognozirovanie veroyatnosti vozniknove-
    niya holecistita na osnove nejrosetevogo analiza faktorov riska. In: Issledovaniya i praktika
    v medicine. №4(4), pp. 67-72. (2017) https://doi.org/10.17709/2409-2231-2017-4-4-7
 4. Dahiwade, D., Patle, G., Meshram, E.: Designing Disease Prediction Model Using Machine
    Learning Approach. In: 2019 3rd International Conference on Computing Methodologies
    and      Communication      (ICCMC),        pp.     1211-1215,     Erode,    India     (2019)
    https://doi.org/10.1109/ICCMC.2019.8819782
 5. Christensen, A., Frandsen, A., Glazier, S., Humpherys, J.: Machine Learning Methods for
    Disease Prediction with Claims Data. In: 2018 IEEE International Conference on Healthcare
    Informatics       (ICHI),      pp.      467-474,        New      York,      NY        (2018).
    https://doi.org/10.1109/ICHI.2018.00108
  Methodology for Preprocessing Semi-Structured Data for Making Managerial Decisions… 11


 6. Shukla, N, Hagenbuchner, M., Win, T. K.: Breast cancer data analysis for survivability
    studies and prediction. In: Computer Methods and Programs in Biomedicine (2017)
    https://doi.org/10.1016/j.cmpb.2017.12.011
 7. Lohr, S.: For Big-Data Scientists, 'Janitor Work' is Key Hurdle to Insights, http://www.ny-
    times.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-
    work.html?_r=0 Last accessed 14 July 2020
 8. Makarova, E., Lagerev, D., Lozbinev, F.: Approaches to visualizing big text data at the stage
    of collection and pre-processing. In: Scientific Visualization N. 11.4, pp. 13–26, (2019).
    https://doi.org/10.26583/sv.11.4.02
 9. Karpov, O.E., Gavrishev, M.YU., SHishkanov, D.V.: Integraciya medicinskoj informacion-
    noj sistemy i sistemy administrativno-hozyajstvennoj deyatel'nosti kak instrument optimi-
    zacii processov medicinskoj organizacii. Otdel'nye problemy i puti ih resheniya. In: Sov-
    remennye naukoemkie tekhnologii. № 9-1. pp. 46-50. (2016)
10. Portal of operational interaction of USISH participants http://portal.egisz.ros-
    minzdrav.ru/materials Last accessed 14 July 2020
11. Kreuzthaler, M., Oleynik, M., Avian, A., Schulz, S.: Unsupervised Abbreviation Detection
    in Clinical Narratives. In: Studies in Health Technology and Informatics. v. 245, pp. 539–
    543 (2016)
12. Lagerev, D., Makarova, E., Features of preliminary processing of semi-structured medical
    data in Russian for use in ensembles of data mining models. 2020. Т. 17, № 7. pp. 43–53.
    https://doi.org/ 10.14489/vkit.2020.07.pp.043-053
13. Zellig, S. H.: Distributional Structure. v.10. pp. 146-162, Word (1954),
    https://doi.org/10.1080/00437956.1954.11659520
14. Tanimoto, T.T.: IBM Internal Report 17th Nov. IBM. Corp, New York (1957).