Development of Methods for Extracting
           Information from Pharmacy Line Using
                 Conditional Random Fields

    Alexey I. Molodchenkov1,2,3[0000−0003−0039−943X] , Artem A. Nikolaev1 , and
                            Evgenia A. Mitrokhina2
    1
        Federal Research Center “Informatics and Control” of the Russian Academy of
                                     Sciences, Moscow
               2
                   Moscow Institute of Physics and Technology, Dolgoprudny
                    3
                      Peoples’ Friendship University of Russia, Moscow
                       mitrohina.ea@phystech.edu, aim@tesyan.ru


          Abstract. The paper considers the solution to the problem of extracting
          information from short lines of pharmacological orientation in Russian
          language. As an example, pharmacy lines are used, from which you need
          to extract the full name of the drug, manufacturer, form of issue, dosage,
          number of pieces in a package and some other parameters. To extract
          this information, a conditional random field (CRF) algorithm was used.
          There was also created a method for preliminary standardization of the
          strings to bring string tokens to a single form. More than seven thousand
          pharmacy lines were marked for the experiments and 2 CRF models
          were trained - with and without preliminary standardization of the lines.
          For the model with standardization, the following results were obtained:
          accuracy for different data sets is 0.95 (on the validation set) and 0.89
          (on the test set). For the model without standardization, the accuracy
          is 0.95 (on the validation set) and 0.87 (on the test set).

          Keywords: Named Entity Recognition · Conditional Random Fields


1       Introduction

Extracting information from texts is relevant as it is used to solve a number of
problems. The main goal of the tasks of extracting information from texts is to
convert unstructured text data to some structured form (for example, a table or
a semantic graph) for further processing of the received data.
    Text analysis mainly consists of the following steps:

 – vectorization of text;
 – application of various methods (for example, machine learning) for their
   further processing, depending on the problem being solved.

   Text vectorization is converting words to normal form and then converting
them to vector form. For this, methods of tokenization, morphological analysis


Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).


                                            340
and vectorization are used. To convert words to an imperfect form, the libraries
Mystem [1] for Russian, pymorphy2 [2] and nltk [3] for Russian and some other
languages can be used. Methods and pre-trained models can be used to vectorize
words and texts, such as: a bag of words [4], word2vec [5], doc2vec [6] and others.
    At the next stage, depending on the task, regular expressions, rules, addi-
tional dictionaries, machine learning methods, etc. are applied.
    In this paper, we consider the problem of extracting information from short
pharmacy lines containing information about goods sold, for their further com-
parison with a predetermined reference book of medicinal products. Such solu-
tions can be applied in various fields of activity and companies. For example, a
marketing agency can use this information to assess the pharmaceutical market.
Large companies with many warehouses and stores can use this kind of solution
to automate the accounting of their products.
    A feature of the texts used in this work is their small length and high density
of entities that need to be recognized. For example, a pharmacy line contains
information about the name of the drug, manufacturer, batch number, taste, if
available, form of release, dosage, etc. Also, the texts contain many words that
were not previously known (for example, new names of drugs or manufacturers),
a minimum of grammar and many abbreviations. These features severely limit
the application of the most commonly used approaches and algorithms.

2   Problem Statement
The task of extracting information from texts is a Named Entity Recognition
task (NER). A named entity is an n-gram in text for which a class is defined.
The task of recognizing named entities is to select continuous fragments of text
and classify them.
    At the entrance, a pharmacy line in Russian is given approximately of the
following type: "АСКОРБИНОВАЯ К-ТА ГЛЕНВИТОЛ КЛУБНИКА №10
ТАБ.ЖЕВ. КРУТКА". It is necessary to first recognize the name of the drug,
manufacturer, lot number and other parameters in this line, then link them to
the reference name of the drug, manufacturer, lot number, etc. for further search
for this string in the directory.
    Let us list the problems that complicate the solution of this problem, which
are to be solved:
 – Abbreviations of some words ("к-та" instead of "кислота").
 – Producers recorded in different languages ("биодерма лаборатория" and
   "BIODERMA LABORATORIES").
 – Words that have multiple meanings depending on the context (the word
   «мед» as a taste or an abbreviation for the word «медицинский»).

3   An Overview of Named Entity Recognition Methods
Initially, the NER problem was solved without machine learning at all - using
rule-based systems (for example, regular expressions). This solution stops work-


                                       341
ing normally as soon as any ambiguities of the natural language come into play,
but even in our task it can be used to determine the batch number, since a
limited number of ways of recording it can be distinguished in the data. This
solution gives us an f1-score of about 0.96 on one dataset and 0.93 on the other.
    In [7], the authors investigated several different ways to recognize names,
dates, locations, phone numbers and times from short messages in Swedish,
including regular expressions. This method shows the best result for dates (0.72
F-measures), the worst - for locations (0.57 F-measures). The paper also shows
that dictionaries and parts of speech significantly improve this result (the average
F-measure increased from 0.65 to 0.84).
    Progress in solving the NER problem has become the methods of classical
supervised machine learning. In addition, entity dictionaries were actively used,
which did not solve the ambiguity problem, but improved the quality. Among the
algorithms that were actively used then were Support Vector Machine (SVM, [8])
and Conditional Random Fields (CRF, [9]), but also decision trees ([10]), hid-
den Markov models ([11]) and others. The disadvantage of these models is that
feature selection is a completely empirical process, primarily based on linguistic
intuition, and then a trial and error method; and the choice of features depends
on the problem, which implies additional research for each new NLP problem. A
more detailed overview of methods for solving the problem of recognizing named
entities can be found in the source [12].
    If we are talking about modern algorithms, then the problem of recognizing
named entities is solved usually by neural network algorithms using Bi-LSTM
+ CRF (long short-term memory + conditional random fields [13]). Pre-trained
embeddings are applied to the Bi-LSTM input, after several layers of Bi-LSTM
and the output is a conditional random field (an undirected graph model, without
which, as a rule, it is impossible to achieve state-of-the-art results). You can also
add capitalization features, parts of speech, morphological features, etc. to the
input to embeddings (Bi-LSTM + CRF + Char + Capitalization + POS).
    In the article [14], the authors tested several variants of neural network archi-
tectures containing char and word Bi-LSTM, CRF, word embeddings, highway
networks, etc. on three Russian-language datasets (Gareev’s dataset, FactRuEval
2016, Persons-1000), and it was the Bi-LSTM + CRF + external word embed-
dings model that showed state-of-the-art results (F-measure 87.17, 99.26, 82.10,
respectively).
    Separately, I would like to mention that short texts differ significantly from
long ones, and standard methods for recognizing named entities will work poorly
for them. This is exactly what is shown in the article [15] - the quality has
dropped from the usual 0.8 - 0.9 to 0.3 - 0.5 for tweets.
    [16] demonstrates the results of using various existing systems for the task of
recognizing named entities in tweets. Some Twitter-specific methods achieve F1
scores over 0.8, but are still far from the current results achieved with longer news
texts. The authors say that the main reason for the deterioration in results is the
poor use of capital letters (poor capitalization) - this feature is very important
for the task of recognizing named entities. Also, abbreviations and slangs worsen


                                        342
the quality of words that are not included in the dictionary, but their influence
is no longer so significant.


4    Training CRF Model to Extract Entities from
     Pharmacy Strings

The training was carried out on 6000 marked lines, which were combined into a
table. Each row of the table contains the pharmacy line itself, as well as all the
parameters that need to be extracted from it. The piece of the data is in the
table 1. The output is a trained CRF model capable of predicting an ordered
sequence of classes corresponding to these tokens for any ordered sequence of
tokens.


                           Table 1. Initial data format

                                                                                          Кол-во
     Аптечная                                Наименование      Форма
                      Производитель                                       Дозировка Объём штук в
      строка                                  препарата       выпуска
                                                                                         упаковке
 САГЕНИТ ТАБ          НИЖФАРМ -
                                              САГЕНИТ          ТАБ          100МГ    NaN    Х 30
  100МГ Х 30           РОССИЯ
                        ХОРСТ
    ЭХИНАЦЕЯ
                      КОМПАНИЯ               ЭХИНАЦЕЯ           NaN          NaN     1,5Г   №20
     1,5Г №20
                       (АЛТАЙ)
 КОМПЛИВИТ
              ФАРМСТАНДАРТ-
 КАЛЬЦИЙ Д-3                 КОМПЛИВИТ
                УФИМСКИЙ                  ТАБЛ ЖЕВ
 ФОРТЕ ТАБЛ                  КАЛЬЦИЙ Д-3                                     NaN     NaN    №100
               ВИТАМИННЫЙ                  МЯТНЫЕ
  ЖЕВ №100                      ФОРТЕ
                  3-Д ОАО
   МЯТНЫЕ
ГРУДНОЙ СБОР                  ГРУДНОЙ
                  ЛЕК С+                     NaN                             NaN     50Г    NaN
    №1 50Г                     СБОР №1
ТЕТРАЦИКЛИН
 ТАБ. П/ПЛЕН.                               ТАБ.
                БИОСИНТЕЗ   ТЕТРАЦИКЛИН                                     100МГ    NaN    №20
   ОБ. 100МГ                             П/ПЛЕН. ОБ.
№20(БЛИСТЕР)


    The learning algorithm consists of the following steps:

 – String standardization
 – Converting strings to the format required for using CRF
 – Extraction of features from words
 – Train the CRF Model to Predict the Class for a Word Based on Extracted
   Features

    Let’s consider the presented steps of the algorithm in more detail.


                                       343
4.1     String Standardization
By standardizing a string in this task, we mean bringing the string tokens to a
single form. The method that standardizes strings does the following conversions:
 – Removes extra characters (quotes, brackets, commas)
 – Brings tokens in cyrillic to a single form, uses a dictionary of substitutions
   for this. At this step, the most frequent errors in the spelling of tokens are
   "corrected", the ending is brought to a pre-selected form and abbreviations
   are replaced with full words
 – In fractions, replaces a comma with a dot
 – Removes extra spaces and add spaces where needed.
      Example string before standardization
      ’ВАКСИГРИП СУСП.В/М И П/К 0,5МЛ/ДОЗА ШПР. №1’
      and after it ’ВАКСИГРИП СУСПЕНЗИИ ВНУТРИМЫШЕЧНОГО
      ВВЕДЕНИЯ И ПОДКОЖНОГО 0.5 МЛ ДОЗА ШПР №1’
      The application of standardization in this task has several goals:
 – This approach allows you to improve the accuracy of the model and learn
   better on a small sample (or a smaller sample to achieve similar quality, if
   we consider an approach with and without standardization).
 – Since we isolate and classify tokens to further search for the closest drug
   or product in a directory consisting of all possible options, the second goal
   of standardization is to use ordinary equality instead of using metrics to
   compare the proximity of tokens. This allows you to use filtering by those
   fields that are unambiguously standardized in our country.

4.2     Converting a String to the Form Required to Use CRF
Initially, the data is a table of almost 6,000 labeled rows. Each row of the table
contains the pharmacy row itself, as well as all the parameters that need to be
extracted from it (see Table 1).
    To train the CRF model, it is necessary to present the data in the form of a
table, each row of which contains one token, the number of the pharmacy line
from which this token was taken, as well as the class corresponding to this token
(see Fig. 1).
    Description of possible classes:
 – FORM_QN - number of pieces in a package
 – FULL_NAME - full name of the drug
 – MV - volume
 – NM_D - dosage
 – NM_F - form of issue
 – PROD - manufacturer
 – O - does not belong to any of the above classes
      Not all the parameters listed here are required to appear in every line.


                                         344
                Fig. 1. Data in the format required to use the CRF


4.3   Features that Were Used to Train the Model

As features of the word were used: the word itself in lower case, the last 2
characters of this word, the length of the word and a flag about whether this
token is a number or not. And also the same features for two neighboring tokens.


4.4   Teaching the CRF Model to Predict the Class for a Word

To train the model and conduct experiments, the entire data set was divided
into training and test samples (the size of the test sample is 20% of the entire
data set).
    The CRF (Conditional Random Fields) method was chosen as a classification
method, because it allows you to independently form a set of features by which
you can vectorize words and texts and is popular for the NER problem, as it
is intended for marking sequences. Using word embedding and other standard
vectorization methods is not suitable for this task. New drugs appear, all words
are specific, and the existing methods and pre-trained models were trained in a
common vocabulary.
    A random field is a multidimensional random variable V, where each com-
ponent is a one-dimensional random variable. For convenience, we will assume
that ∀i Vi are discrete and the set of their values is finite. We denote the im-
plementation of a multidimensional random variable V as v ∈ Ω, where Ω is the
set of all possible configurations. A random field can be represented as a graph,
in which the vertices are the components of the multidimensional random vari-
able V, the edges are the dependencies between them. A random field is called
Markov if 2 Markovian conditions are satisfied:
    1. ∀v ∈ Ω P (V = v) > 0
    2. P (Vi = vi |Vj = vj , j ∈ A\ {i}) = P (Vi = vi |Vj = vj , j ∈ δi)
    where δi - set of neighbors of the vertex Vi .


                                       345
    A conditional random field is a Markov random field, in which the set of ran-
dom variables is divided into 2 disjoint subsets - X and Y - the set of observable
and hidden variables. The prediction task is to optimally reconstruct the values
of y, provided that we know the observables x. That is, the optimization task is to
maximize the conditional probability p (y | x): y ∗ = argmaxy p(y|x). Calculation
of the model p * (y | x) is solved as an optimization problem with given con-
straints (the difference
                    P between the observation and its estimate must be minimal
and the condition x p(y|x) = 1 for all x). According to the Hammersley-Clifford
theorem (which connects Markov random fields and the Gibbs distribution), we
need to maximizeQ
                        ψc (x,y)
               c∈C(G)
    p(y|x) = P     Q
                           ψc (x,y ′ )
                                         ,
              ′
             y ∈y c∈C(G)

    where the factor functions ψc are usually the exponent of a linear combina-
tion of functions from features with weights that need to be determined during
                     K
                     P
training ψc = exp(      fk (xc , yc )θk ). This method belongs to the probabilistic
                    k=1
methods of classical machine learning. Its implementation has good speed, which
is very important when processing large amounts of information.
    More details about the CRF method can be found in [9].


5   Experimental Research

For the experiments, 2 samples were used. The first sample contains 6,000 phar-
macy lines and is randomly divided into training and validation at a ratio of
80%/20%. The second sample is an additional 1000 lines taken from another
dataset, which contains a significant proportion of the unknown drug for the
model, since they were absent in the training sample. This sample was used for
the test.
    The two resulting models (with and without string standardization) were
tested on validation and test datasets. In the tables 2, 3, 4 and 5, you can see
the results of the experiments.
    Vectorization of tokens by n-grams and further comparison of vectors using
cosine distance were used as a baseline. The resulting average accuracy for further
comparison was 0.65.
    The first thing you may notice is better quality of both models compared to
the baseline.
    The model shows the worst results on the test data (especially for MV and
NM_D). This can be explained by the fact that the data in the test set contain
a large number of completely new drugs for the model and have some differences
from the data on which the training and validation was carried out. For example,
dosages and volumes without specifying units of measurement are more common
in the test set.
    You can also notice that on the validation set string standardization does
not improve the prediction quality, but on the test set, there are noticeable


                                             346
improvements for volume, dosage and form of release - the classes on which the
standardization method has the most significant influence. The difference with
validation can be explained by the fact that the data in the test set have more
typos and abbreviations that need to be corrected through standardization, so
the consequences of standardization are more noticeable.
    In all experiments the model predicts full name of the drug NM_FULL
best of all, the worst predictable classes are dosage NM_D and volume MV.
Difficulties with dosage and volume may occur because they are too similar and
easy to confuse.


Table 2. No preprocessing of lines on validation set   Table 3. No preprocessing of lines on test set

                 Precision Recall F1 support                        Precision Recall F1 support
    FORM_QN      0.97      0.98 0.98 545               FORM_QN      0.97      0.97 0.97 981
    FULL_NAME 0.95         0.97 0.96 4112              FULL_NAME 0.86         0.89 0.87 2042
    MV           0.97      0.98 0.97 832               MV           0.58      0.90 0.71 140
    NM_D         0.91      0.89 0.90 469               NM_D         0.70      0.48 0.57 152
    NM_F         0.95      0.93 0.94 1047              NM_F         0.86      0.83 0.84 1228
    PROD         0.94      0.97 0.95 2283              PROD         0.87      0.93 0.90 1958
    O            0.95      0.88 0.91 2280              O            0.86      0.77 0.82 1852
    accuracy                      0.95 11568           accuracy                      0.87 8353
    macro avg    0.95      0.94 0.94 11568             macro avg    0.82      0.82 0.81 8353
    weighted avg 0.95      0.95 0.95 11568             weighted avg 0.87      0.87 0.87 8353


Table 4. With line preprocessing on validation set     Table 5. With line preprocessing on test set

                 Precision Recall F1 support                        Precision Recall F1 support
    FORM_QN      0.96      0.98 0.97 534               FORM_QN      0.97      0.95 0.96 855
    FULL_NAME 0.95         0.97 0.96 4445              FULL_NAME 0.88         0.91 0.89 2186
    MV           0.94      0.98 0.96 1419              MV           0.66      0.83 0.74 285
    NM_D         0.95      0.90 0.92 939               NM_D         0.75      0.59 0.66 272
    NM_F         0.97      0.95 0.96 1377              NM_F         0.95      0.90 0.92 1811
    PROD         0.95      0.98 0.96 2603              PROD         0.88      0.95 0.91 2218
    O            0.95      0.85 0.90 1555              O            0.88      0.78 0.82 1299
    accuracy                      0.95 12872           accuracy                      0.89 8926
    macro avg    0.95      0.94 0.95 12872             macro avg    0.85      0.84 0.84 8926
    weighted avg 0.95      0.95 0.95 12872             weighted avg 0.89      0.89 0.89 8926


6    Conclusion

Using the CRF method, it was possible to obtain a model showing good results
in the recognition of named entities in short texts of pharmacological topics.


                                         347
Accuracy for the validation data is 0.95, for the test data it is 0.89. Deterioration
of results can be explained by the emergence of new drugs that are absent in the
training sample, and by some differences in the data structure - for example, the
frequent absence of units of measure for volume and dosages. In the future, it is
planned to improve the quality by using combinations of different approaches to
build a model for the classification of words and by expanding the set of features
for vectorization of tokens.


References
 1. Mystem. https://yandex.ru/dev/mystem/.
 2. Pymorphy2. https://pymorphy2.readthedocs.io/en/stable/.
 3. Natural language toolkit. https://www.nltk.org/.
 4. Harris Zellig. Distributional structure. Word, 10:146–162, 1954.
 5. Greg Corrado Tomas Mikolov, Kai Chen and Jeffrey Dean. Efficient estimation of
    word representations in vector space. ICLR Workshop Papers, 2013.
 6. Baldwin T. Lau J. H. An empirical evaluation of doc2vec with practical insights
    into document embedding generation. arXiv preprint arXiv:1607.0536, 2016.
 7. Tobias Ek, Camilla Kirkegaard, Håkan Jonsson, and Pierre Nugues. Named entity
    recognition for short text messages. Procedia - Social and Behavioral Sciences,
    27:178–187, 2011. Computational Linguistics and Related Fields.
 8. William S Noble. What is a support vector machine? Nature Biotechnology, 2006.
 9. Bengong Yu and Zhaodi Fan. A comprehensive review of conditional random fields:
    variants, hybrids and applications. Artificial Intelligence Review, 2020.
10. S. B. Kotsiantis. Decision trees: a recent overview. Artificial Intelligence Review,
    2013.
11. L. Rabiner and B. Juang. An introduction to hidden markov models. IEEE ASSP
    Magazine, 3(1):4–16, 1986.
12. David Nadeau and Satoshi Sekine. A survey of named entity recognition and
    classification, Jan 2007.
13. Changki LEE. Lstm-crf models for named entity recognition. IEICE Transactions
    on Information and Systems, E100.D(4):882–887, 2017.
14. The Anh Le, Mikhail Arkhipov, and Mikhail Burtsev. Application of a hybrid
    bi-lstm-crf model to the task of russian named entity recognition. pages 91–103,
    09 2018.
15. Alan Ritter, Sam Clark, Oren Etzioni, et al. Named entity recognition in tweets:
    an experimental study. In Proceedings of the 2011 conference on empirical methods
    in natural language processing, pages 1524–1534, 2011.
16. Leon Derczynski, Diana Maynard, Giuseppe Rizzo, Marieke van Erp, Genevieve
    Gorrell, Raphaël Troncy, Johann Petrak, and Kalina Bontcheva. Analysis of named
    entity recognition and linking for tweets. Information Processing and Management,
    51(2):32–49, 2015.


                                         348