Methods for the Intelligent Analysis of Biomedical Data
                             E.V. Geger1, A.G. Podvesovskii2, S.A. Kuzmin2, V.P. Tolstenok2
                  emiliya_geger@mail.ru|apodv@tu-bryansk.ru|wolv3333@mail.ru|tolstenok21@yandex.ru
                                    1
                                      Bryansk Clinicodiagnostic Center, Bryansk, Russia
                                  2
                                    Bryansk State Technical University, Bryansk, Russia
     The paper discusses methodology of cleaning and analysis of small semi-structured samples of biomedical data. This methodology
is aimed at statistical evaluation of harmful production factor correlation with workers’ laboratory test data. As a result of the analysis
and interpretation of the data, a deviation from the norm is observed according to some indicators of a clinical blood test in individuals
whose occupational activity is associated with harmful factors. Conclusions are drawn about the need for further research in the group
of people whose work is related to harmful production factors. It is necessary to employ intelligent methods for analyzing possible health
risks and their negative consequences in order to make management decisions. The presented assessment methodology can be used to
create an occupational health and safety information system.
   Keywords: risk assessment, data analysis, harmful working conditions, statistical methods, data cleaning, model ensembles,
Kohonen self-organizing maps.

                                                                            New analytical technologies can not only increase the
1. Introduction                                                         efficiency of medical institutions but also make it possible to
                                                                        solve such health problems as identifying diagnoses, medical
     In the modern world, the necessity often arises for auxiliary
                                                                        errors, associative connection of diagnoses with results of
methods of preliminary disease detection at early stages. This
                                                                        laboratory tests and much more.
problem is especially characteristic of people whose labor
                                                                            The objective of the present research has been to assess
activity is associated with the constant impact of harmful
                                                                        relationship between occupational morbidity and harmful
working conditions [6, 10].
                                                                        production factors. So the experimental sample consisted of
     In turn, increased level of exposure to harmful substances
                                                                        individuals whose occupational activities were associated with
and related industrial hazards significantly increase the
                                                                        harmful and dangerous working conditions. Another objective
likelihood of developing occupational diseases and the risk of
                                                                        was development of a new methodology focused on processing
injury [8].
                                                                        and analyzing small semi-structured samples of biomedical data.
     Thus, occupational risk management is a complex of
organizational and technical measures which should be based on          2. Existing Solutions
reliable results of data analysis [9].
     For the tasks of occupational risk assessment, it is necessary         Among many proposed methods and tools for working with
to use medical data analysis methods that correspond to these           small samples, it is advisable to apply ensemble data analysis
tasks. Choice of methods affects the construction of theoretical        methods in combination with the basic classifier – Kohonen self-
biomedical models and characteristics of experimental studies [7,       organizing maps.
17].                                                                        An ensemble of models is a combination of several learning
     However, many experts note that biomedical data are often          algorithms that, working together, help to build a model more
unsuitable for processing using traditional software not only           efficient and accurate than any of the models built using a
because of their volume but also because of the variety of data         separate algorithm. That is, to find a solution for one problem or
types and speed at which they must be analyzed [1, 3, 15].              to prove a hypothesis, not one but several models are used.
Therefore, a system for the intelligent analysis of medical data is     Besides, the overall operating result matters and not that of a
required, which could aggregate and analyze heterogeneous               single separate model.
information coming from different sources: electronic medical               Formerly, researchers faced the problem of combining
records, data from monitoring sensors, ultrasound and X-ray             accuracy, simplicity and ease of interpretation in one method. A
apparatus and other devices [4].                                        simple method could be used with a relatively easy interpretation
     If we consider real biomedical data, they have a number of         but coming short of accuracy or vice versa, a complex but
specific features that make them unstructured: presence of              accurate method could be chosen that would be difficult to
various data corruptions, such as omissions, extreme values,            interpret. Ensembles of models have become a solution to this
manual entry errors, incorrect information, high dimensionality         problem as a universal way of improving the accuracy of
and heterogeneity, a large number of noisy and duplicate data.          methods.
This leads to the unsuitability of most of the data or the entire           Ensemble learning refers to the training of a finite set of basic
sample for existing analysis algorithms [16].                           classifiers with the subsequent combination of their forecasting
     Analysts believe that today, to solve many tasks of the            results into a single forecast of an aggregated classifier. Thus, an
healthcare system, it is necessary to go in the direction of            aggregated classifier will give a more accurate result.
structuring information and focusing on work with small samples             The goal of combining models is to improve (enhance) the
[5].                                                                    solution provided by a separate model. It is assumed that a single
     Particular attention should be paid to the so-called small         model will never be able to achieve the efficiency that the
samples, the volume of which is about 100-200 records. This, in         ensemble will provide.
turn, is very difficult and makes the use of existing methods of            Self-organizing maps are a special type of artificial neural
data processing and analysis ineffective.                               network allowing for non-linear regression and projection of
     The main obstacle to the manual analysis of small samples is       multidimensional data onto a two-dimensional plane with
the inability of the analyst to notice hidden patterns in presented     preservation of distances in their original data space [2]. This
data, while special analytical algorithms detect existing patterns      approach can be applied in various fields, including biomedical
with much greater efficiency. The main task of the analyst, in this     data processing.
case, is to interpret the results of these algorithms and filter out
false and trivial patterns.


Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
3. Proposed Theoretical Solution                                             features of this model, in addition to the fact that it does not
                                                                         allow meaningless data loss, are:
    To solve the research problems of working environment risk               1. Correction of empty values.
assessment, we formed two groups:                                            2. Deletion of entries containing no information.
    Group I included individuals whose labor activity was                    3. Recalculation of deviations by scores.
associated with exposure to harmful production factors.                      4. Recovery of missing scores by deviations.
    Group II consisted of individuals whose professional activity            5. Reassignment of diagnoses by ICD codes.
lacked harmful production factor.                                            6. Assigning a unique number to each patient.
    The studies were carried out in the laboratory of Bryansk                7. Increasing the number of records suitable for analysis
Clinical Diagnostic Center; the results were reflected in the                     without generating synthetic patients.
medical information system "MAIS DC".                                        8. Download and comparison with ICD certified directory
    The list of occupational diseases was determined in                           of codes and their descriptions.
accordance with the Order of the Ministry of Health and Social               9. Uploading data that has been damaged and
Development of the Russian Federation No. 417n dated April 27,                    subsequently corrected for additional possible analysis.
2012 “On the Approval of the List of Occupational Diseases”.                 10. Converting source data into a form suitable for
Those working in harmful labor conditions are at risk for diseases                intelligent analysis methods.
associated with exposure to occupational physical factors [12].              At the second stage, development and construction of a
    The studies were carried out in accordance with the Order of         model ensemble takes place. To simplify the development
the Ministry of Health of the Russian Federation dated April 12,         process, the following algorithm must be followed:
2011 No. 302n (as amended on February 06, 2018) “On the                      1. Select a base model.
Approval of Lists of Harmful and (or) Hazardous Occupational                 An ensemble can consist of classifiers of one type (e.g., only
Factors and Kinds of Work that Require Mandatory Preliminary             of decision trees or only of neural networks) or of classifiers of
and Periodical Medical Examinations (Surveys), and the                   various types (decision trees, neural networks, regression
Procedure for Conducting Mandatory Preliminary and Periodical            models, etc.).
Medical Examinations (Surveys)” [13].                                        2. Define the approach to using the learning set.
    To assess the possible risk of diseases caused by harmful                This can be resampling (several subsamples are extracted
occupational factors, the values of clinical blood test scores were      from the original learning set, each of which is used to train one
used as a source of information. The experiment was carried out          of the ensemble models) or the use of one learning set to train all
in compliance with the ethical principles of biomedical research         ensemble classifiers.
and in accordance with the Federal Law of the Russian                        3. Select a method for combining results.
Federation No. 152 "On Personal Data" [14].                                  Three methods are usually used: voting (the class is selected
    For morbidity analysis, the "International Statistical               that has been produced by a simple majority of ensemble
Classification of Diseases and Related Health Problems" of the           models), weighted voting (the result is delivered taking into
tenth revision (ICD-10) was used [11].                                   account weights set for ensemble models) and averaging (the
    To solve the problem, a methodology was proposed                     output of the entire ensemble is defined as the simple average
consisting of the following stages:                                      value of outputs of all models; in weighted averaging, the outputs
    1. Data cleaning.                                                    of all models are multiplied by the corresponding weights).
    2. Model ensemble development.                                           As a result of applying this algorithm, a basic ensemble of
    3. Result interpretation.                                            models has been obtained (Fig. 1), which can be further modified
    To carry out the first step, a special model has been designed       and complexified.
and developed, the task of which is to clean the source data and
convert unstructured data into ordered ones.
    The resulting model can be roughly divided into five parts.
    Data import. It includes setting field names and labels,
excluding empty fields, setting data types.
    Data preprocessing. It is a submodel that generates two sets
of data at the output – intact data, containing no errors, and
corrupted data, which are input to the next submodel.
    Data cleaning. This submodel receives damaged data as
input and, after appropriate processing, provides output of
cleaned data that does not contain initial errors. In particular, this
submodel sets correct deviation values. In addition to the cleared
data, the output of the submodel contains data sets with incorrect
values in deviations and in scores.
    Data merge. In this part, aggregation of initially complete
and cleared data takes place. In addition to this, using a code
written in the JavaScript programming language, patient’s                       Fig. 1. Structure of the developed model ensemble
unique identifier is generated and all diagnoses are divided into
several records, so that three records containing one diagnosis are          The data that has undergone preliminary cleaning at the first
obtained from one record containing three diagnoses. This, in            stage will be the input data.
turn, allows for data structuring. The converted data can be used            At the third stage, the results obtained during the ensemble
when applying transaction analysis algorithm. Due to the errors          application are interpreted. In most cases, for their qualitative
in the diagnosis name, it was decided to download an excel file          interpretation, it is necessary to contact specialists in the field of
containing codes and names of 12257 diagnoses of the                     medicine.
international classification of diseases of the tenth revision [11].         For automatic analysis and visualization, ready-made
    In the resulting data set, the initial field “Diagnosis Name”        analytical platform Deductor Studio was used, and, for
has been replaced by the name of the diagnosis from this file.           preprocessing and data cleaning, Loginom software products
    Data export. It creates a file with the extension .txt, suitable     developed by BaseGroupe Labs [18, 19] were used.
for uploading to analytical platforms and further analysis.
    Thanks to the preprocessing of the initial sample, it is             Also, this age group is characterized by significant positive
possible either to level out completely or to minimize the errors    deviations from the norm in terms of hemoglobin, red blood
related to the human factor, which makes it possible to speak        cells, and erythrocyte sedimentation rate.
about sufficient correctness of the medical research.                    The increase in red blood cells is especially noticeable in the
Unfortunately, small semi-structured samples cannot be called        group of individuals whose work is associated with harmful
sufficiently representative. However, the identified patterns may    occupational factors. Erythrocytosis could be caused by external
be of interest to experts or for further verification.               factors.
                                                                         Survey results may indicate the presence of pathological
4. Proposed Practical Solution                                       processes in the body. Additional studies are needed to diagnose
                                                                     the disease that has caused high content of red blood cells.
    Laboratory test analysis has been carried out in workers
                                                                         The methodology for analyzing small semi-structured
whose occupation is related to harmful factors (Group I) and a       samples of biomedical data considered in the article can be used
control group (Group II).
                                                                     with adequate efficiency as an integral part of the risk analysis of
    Two small samples have been formed: the first sample
                                                                     diseases related to harmful occupational factors. Application of
includes 100 records; the second one includes 207 records for
                                                                     the proposed method is demonstrated on specific actual data,
each of the scores taken into account for each group. A total of 9
                                                                     collection and consolidation of which has been carried out using
scores of the general blood test were selected.
                                                                     medical automated information system. This provides reliable
    Data on the primary disease incidence in workers from
                                                                     evaluation of harmful occupational factor effects on working
Groups I and II have been analyzed. The results of diagnoses
                                                                     population’s health indicators.
processing are presented.
                                                                         The research results will help to define and evaluate
    The results of the study made it possible to identify the
                                                                     occupational risk factors that increase the likelihood of disease
diagnoses that most often occurred in Group I and Group II:
                                                                     development and to draft proposals for the prevention of harmful
    E78.0 – Pure hypercholesterolemia;
                                                                     effects of occupational factors on working citizens’ health.
    G90 – Disorder of autonomic nervous system, unspecified;
    H52.4 – Presbyopia;                                              5. Conclusion
    I10 – Essential [primary] hypertension.
    The cross-table as an interactive tool for data representation        The article discusses a methodology tested on specific actual
and analytical processing has allowed creating a pivot table         data, which allows pre-processing, cleaning and analysis of small
which represents data on quantitative composition of diseases for    biomedical data samples.
each group.                                                               As a result of the analysis, there have been revealed no
    The use of Kohonen self-organizing maps helped to identify       statistically significant difference in blood indices in individuals
differences and patterns between the studied groups.                 from groups I and II.
    Examples of Kohonen self-organizing maps used are                     Also, the study of the diagnoses of the primary disease
presented in Fig. 2.                                                 incidence in both groups has not revealed statistically significant
                                                                     differences between the groups.
                                                                          However, the established significant deviations from the
                                                                     normal content of red blood cells and eosinophils in the blood of
                                                                     individuals from Group I may indicate the presence of certain
                                                                     pathological processes in the body, in particular, autoimmune
                                                                     processes, identification of which requires additional research.
                                                                          In the future work perspective, it is advisable to conduct a
                                                                     research on these clinical blood test results using the traditional
                                                                     analysis of variance method regardless of normal intervals. This
                                                                     will allow for comparing the results obtained. Hereupon, it will
                                                                     be possible to draw final conclusions about the influence of
                                                                     specific occupational factors on the development of pathological
                                                                     processes.
                                                                          The results obtained help to increase the efficiency of
                                                                     detecting patterns in biomedical databases and are of interest for
                                                                     the construction of intelligent systems designed to analyze and
                                                                     assess human health.

                                                                     6. Acknowledgements
                                                                        The reported study was funded by RFBR, project number 19-
                                                                     07-00844.

      Fig.2. An example of data processing results using             7. Литература
               Kohonen self-organizing maps                              [1] Bruce McCormick (2014) Update in Anaesthesia.
                                                                     World Federation of Societies of Anaesthesiologists. 466 p.
    It should be noted that in both groups at least 24% of               [2] Kohonen T. The Self-Organizing Map // Proceeding of
employees were people aged 54 and over. In this regard, it has       the IEEE. 1990. Vol. 78. P. 1464-1480.
been suggested that these diseases may be associated primarily           [3] Manyika J., Chui M., Brown B. et al. (2011) Big Data:
with age-related changes, for example, age-related decrease in       The Next Frontier for Innovation, Competition, and Productivity
the accommodative ability of the eye associated with the natural     / McKinsey Global Institute.
process of aging of the lens, a prolonged and persistent blood           [4] Baranov A.A., Namazova-Baranova L.S., Smirnova
pressure increase, blood cholesterol increase.                       I.V. et al. Methods and Tools for Complex Intelligent Analysis
                                                                     of Medical Data. Trudy ISA RAN. Vol. 65. 2.2015. pp. 81-93.
     [5] Barriers and Prospects for Digital Transformation: Big
Data Management Issues in the Healthcare Industry [Online]. –
Available:        http://www.medlinks.ru/article.php?sid=83028
(Accessed: July 23, 2019).
     [6] Geger E.V., Fedorenko S.I. Information Support of
Decision-Making when Assessing the Risk for Occupational
Morbidity Based on Analysis of Binary Samples // Proceedings
of Southwest State University. Control, Computer Engineering,
Information Science. Medical Instruments Engineering, no. 2
(27). 2018. pp. 101-107.
     [7] Healthcare will Show the Largest Increase in Data
Generation       by      2025      [Online].      –      Available:
http://apcmed.ru/news/news-all/zdravookhranenie-pokazhet-
naibolshiy-rost-v-generatsii-dannykh-k-2025-godu/ (Accessed:
July 23, 2019).
     [8] Izmerov N.F., Actualization of Occupational
Morbidity Issues // Health Care of the Russian Federation
(Zdravookhraneniye Rossiyskoy Federatsii.), no. 2. 2013. pp. 14-
17.
     [9] Ismailova L.N. Effective Management of Production
Risks. // Economy and Business: Theory and Practice. 2016. No.
5. pp. 77-79.
     [10] Kostenko N.A. Working Conditions and Occupational
Morbidity as a Basis for Risk Management of Workers' Health:
abstract of cand. med. sci. diss. M., 2015. 21 p.
     [11] International Classification of Diseases of the tenth
revision (ICD-10) [Online]. – Available: https://mkb-10.com
(Accessed: July 20, 2019).
     [12] The Order of the Ministry of Health and Social
Development of the Russian Federation No. 417n dated
27/04/2012 “On the Approval of the List of Occupational
Diseases”.             [Online].            –            Available:
http://base.garant.ru/70177874/ (дата обращения 23.05.2019).
     [13] The Order of the Ministry of Health of the Russian
Federation dated 12/04/2011 No. 302n (as amended on
06/02/2018) “On the Approval of Lists of Harmful and (or)
Hazardous Occupational Factors and Kinds of Work that Require
Mandatory Preliminary and Periodical Medical Examinations
(Surveys), and the Procedure for Conducting Mandatory
Preliminary and Periodical Medical Examinations (Surveys)”.
[Online].     –     Available:     http://base.garant.ru/12191202/
(Accessed: May 21, 2019).
     [14] The Federal Law dated 27/07/2006 No. 152-FZ (as
amended on 29/07/2017) "On Personal Data". [Online]. –
Available: http://base.garant.ru/5635295/ (Accessed: May 21,
2019).
     [15] Tsvetkova L.A., Cherchenko O.V. Big Data
Technology in Medicine and Healthcare in Russia and the World
// Information technologies for the Physician, 2016. No. 3. pp.
60-73.
     [16] Tsygankova, I.A. Method of Intelligent Processing of
Biomedical Data [Text] / I.A. Tsygankova // Software Products
and Systems. – 2009. –no. 3. – pp. 120-123.
     [17] Choporov O.N., Razinkin K.A. Optimization Model of
Choice of the Initial Plan of Control Actions for Medical
Information Systems / Control Systems and Information
Technology. 2011. Vol. 46. No. 4.1. P. 185-187.
     [18] BaseGroup. Data Analysis Technologies [Online] //
Available: https://basegroup.ru/ (Accessed: May 20, 2019).
     [19] Loginom [Online] // Available: https://loginom.ru/
(Accessed: May 20, 2019).