Automatic Detection of Contraindications of Medicines in Package Leaflet Jonas Žalinkevičius Rita Butkienė Faculty of Informatics Faculty of Informatics Kaunas University of Technology Kaunas University of Technology Kaunas, Lithuania Kaunas, Lithuania jonas.zalinkevicius@hotmail.com rita.butkiene@ktu.lt Abstract— Before physicians prescribe medicines, they must A system that automates the extraction of take into consideration the patient’s diseases and medicines they contraindications from leaflet text is described is in Section 3. use. This is done to avoid complications that may occur. All Using this system all leaflets of medicines registered in information about possible contraindications is written in the Lithuania were analyzed. The results of this analysis medicine package leaflet. A system that can automatically detect (contraindications extracted) are used in a commercial contraindication mention in the Lithuanian text of leaflet medications information system that is used by Lithuanian applying natural language parsing is presented. This system physicians for prescription of medications. The evaluation of gives a possibility to shorten the time needed for medicines the obtained results is presented in Section 4. prescription decision making. The results of the experiment showed that the created system successfully detected 56 per cent II. RELATED WORK contraindications. In Lithuania, it is established that each medicine registered Keywords— medicine contraindications, drug–drug in Lithuania must contain a package leaflet describing interactions, shallow parsing, morphological analysis, noun therapeutic indications, possible contraindications, safety phrase detection precautions, and usage information in the Lithuanian language. In order to be sure that the patient does not suffer I. INTRODUCTION from possible contraindication, the physician should read When a patient is diagnosed with a new disease, through all leaflet text before prescribing the medicine. additionally physician asks the patient about his allergies, Usually, the analysis of leaflets is time-consuming, so previous health problems, chronic deceases, what medications physicians tend to skip it and rely on the knowledge and and food supplements he is using. After taking gathered experience they have gained. information into consideration and evaluation of possible There are lots of systems developed for analysis and contraindications with prescribed medication physician information extraction from the biomedical text in the English assigns treatment and, if needed, changes previous language. But there is no solution for the detection of assignments. Almost all information about contraindications contraindication (i.e. contraindication with disease or can be found in the medicine package leaflet. According to contraindication with the pharmacological group) mentions in Lithuania’s medicines registration procedure [1], every Lithuanian written text. We have analyzed articles that package must have a leaflet written in Lithuanian. Information describe similar problems when analyzing biomedical text. in the leaflet must be divided into six sections [2], although For example, a tool Semantator [4] was created for converting the text in a section can be written in not structural manner. biomedical text to linked data. It used ontology-based So, if a physician needs to find possible contraindications, he information extraction using biomedical ontology terms must read all text in the second section (Table 1) or search for hosted in BioPortal and ontology editor Protégé for text information on the Internet. Usually, health care information preprocessing. A semantic annotation and inference platform consists of unstructured data and that leads to inaccurate SENTIENT-MD [3] creates a dependency graph as the first search results that contain hundreds of links to not relevant step for dependency parsing which is one of the tasks of documents. And the user must read through results to find semantic annotation of medical knowledge in natural relevant information. language text. Markus Bundschus [5] used probabilistic Automatic information extraction tools can extract graphical models (Conditional Random Fields) to identify biomedical data, save it in a structural way, and minimize semantic relations. information search problem. However, automatic text analysis Although all these authors work on texts written in and information extraction from unstructured text in the English, we found that common rules and approaches could medical domain is a challenging task [3]. The aim of this paper be applied to Lithuanian texts as well. In order to extract is to present a system that gives physicians the possibility of a information from text, preprocessing is needed using natural faster and more accurate way of finding contraindications language processing: text segmentation, a morphological using automated contraindication detection in the medicine analysis should be performed and then a syntactic parse tree package leaflet. or the dependency graph [6]. [7] should be formed. For semantic relations detection, existing ontologies or knowledge bases should be used. © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) 110 III. SYSTEM DESCRIPTION B. Morphological analysis In this section, a system for the detection of A morphological analysis forms a background for contraindication mentions in the medicine leaflet text written information extraction about contraindications. In this stage, a in Lithuanian is presented. The system implements a text given text is split into lexical units (e.g. sentences, lexemes) analysis pipeline of four analysis stages: extraction of and analyzed morphologically. For this task, a web service contraindication text block, morphological analysis, noun provided by the system “http://semantika.lt” [8] is used. The phrase detection, and annotation. web service returns morphological features for each given lexeme: part of speech, gender, number and so on. Additionally, all annotated phrases are checked is it in the database of noun phrases to be ignored or not. This database C. Noun phrase detection is manually filled and helps to obtain more precise results. The Phrases that express a specific contraindication usually are overall pipeline for the detection of contraindication mentions noun phrases, for example, heart attack, type one diabetes, is shown in fig. 1. pancreatitis, and so on. Therefore, we chose a phrase structure Below each stage of text analysis is discussed in more grammar method because it better fits for noun phrase detail. detection than dependency grammar as it was suggested by Axel Halvoet in his monography [9]. Phrase structure rules are A. Extraction of contraindication text blocks used to split natural language written sentence into its In Lithuania, when describing the medicine, a producer constituent parts: lexical and phrasal categories [9], [10], [11]. must follow a certain template of the package leaflet [2]. This For the noun phrase detection in the medicine’s leaflet, three template splits the description of leaflet into 6 sections listed phrase structure rules ware specified (see Table 2). in Table 1 TABLE II. NOUN PHRASE STRUCTURE RULES TABLE I. MEDICINE PACKAGE LEAFLET SECTIONS No Rule No Section A lexeme is a part of a noun phrase if it is a noun in the genitive 1 case and follows another noun in the genitive case or adjective or 1 What X is and what it is used for numeral or participle. A lexeme is a part of a noun phrase if it is an attributive adjective 2 What you need to know before you X 2 in the same case, number, and gender as a base noun and follows 3 How to X noun in the genitive case or adjective or numeral or participle. A lexeme is a part of noun phrase if it is an attributive numeral in 4 Possible side effects 3 the same case, number and gender as the base noun and follows noun in the genitive case, or adjective, or numeral, or participle. 5 How to store X 6 Contents of the pack and other information An algorithm implemented for the noun phrase detection checks every lexeme in the sentence for the satisfaction of conditions of at least one rule presents in Table 2. If the The information which, the patient should be aware of condition is satisfied a lexeme is included in the noun phrase. before he or she takes the medicine, is presented in section The workflow of analysis of the noun phrase Lėtinis number two. An example of this section is shown in fig. 2 with reumatinis perikarditas (Chronic rheumatic pericarditis) is highlighted contraindications phrases. So, the first task of our shown in Table 3. system is to find this section and extract its text for further analysis. Fig. 1. Contraindications lookup process activity 111 Fig. 2. Example of “What you need to know before use of X” section in the medicine package leaflet TABLE III. EXAMPLE OF NOUN PHRASE DETECTION WORKFLOW and name of the item from the database. If the noun phrase matches the name in ICD the phrase is tagged as Step Action Rule satisfaction contraindication with the disease. If the phrase matches the The first lexeme Lėtinis No rule condition is satisfied ATC item name, it is tagged as contraindication with a (Chronic) is an adjective in fully, but according to rule pharmaceutical chemical group, and if the phrase matches the 1 the nominative case, singular No. 2 the lexeme is a good and of masculine gender candidate for the noun phrase. name of the active substance, it is tagged as contraindication The second word reumatinis No rule condition is satisfied with an active substance. is an adjective in the fully, but according to rule 2 nominative case, singular and No. 2 the lexeme is a good It is worthy to mention that before comparison of the noun of masculine gender and candidate for the noun phrase. phrases all identified phrases are checked against phrases in follows the adjective Lėtinis the database of noun phrases to be ignored. In the text of The condition of rule No. 2 is medicine package leaflet, a lot of words (i.e. illness, hand and The third word perikarditas is satisfied. The noun is a base so on) that are irrelevant (do not express a contraindication) a noun in the nominative case, noun for the first two but are used in ICD, ATC and active substances lists could be singular and of masculine adjectives. They are gender It follows the attributive adjectives of the found. The database of noun phrases to be ignored was filled 3 manually with the help of a professional pharmacist. adjectives lėtinis and noun. So, the condition of rule reumatinis which are in the No. 2 is satisfied as well. The same case, number and analysis of the third lexeme IV. EXPERIMENT gender. completes the construction of the noun phrase. The aim of the experiment is to evaluate the created system and check if a tool can achieve its target - to give physicians the possibility of a faster and more accurate way of finding When the construction of the noun phrase is complete the contraindications. The experiment was done by manually form of the head noun in the phrase is changed to its canonical annotating contraindications mentions in the package leaflet form (lemma). This is done because the name of item text block and comparing results with the system’s results. registered in the International Classification of Diseases (ICD) This was done by a professional pharmacist who works in JSC [12], Anatomical Therapeutic Chemical Classification System Skaitos kompiuterių servisas. (ATC) [13] or lists of active substances are in the canonical form, therefore, normalization is required to ensure the correct comparison of values in the next stage of analysis. A. Plan D. Annotation The experiment was organized as follows. From All noun phrases identified in the previous stage are medicines database ten randomly selected leaflets were reviewed and checked for contraindication. If a analyzed using the system created. The results of the analysis contraindication is identified, the phrase is annotated. For were automatically gathered into the table, which example is annotation three databases are used: ICD, ATC and the lists of presented in Table 4 In the first column the code of item active substances. The algorithm compares the noun phrase automatically found in the text of leaflet by the system is indicated. The second column represents the database (ATC, 112 ICD or active substances) where the item is registered. The B. Results third column was used for the evaluation of annotation The results of the evaluation are presented in Table 5. The correctness. precision, recall and F-Score metrics have been calculated for each leaflet analyzed. Additionally, the ratio between the TABLE IV. AUTOMATICALLY DETECTED CONTRAINDICATIONS RESULTS EVALUATION FOR SINGLE LEAFLET number of correctly detected contraindications and overall automatically detected contraindications was calculated as Code Domain Is detection correct well. This metric allows to evaluate how accurate the results J01CR ATC False are and to use them in further calculations. J05AE ATC True Results showed that the system developed is able to True correctly detect 56% of relevant contraindications. The I09.2 ICD average number of links detected automatically is 1482.8 while manually detected links are 197.9. The number of links The same randomly selected leaflets were analyzed and detected automatically in one leaflet is average four times annotated manually, and the table of the same structure was higher, than detected manually. The average number of filled in with manual annotation results. Manually found erroneous links to ICD is 72%, to ATC - 90%, and links to the contraindications were not interpreted or changed to list of active substances - 61%. synonyms. For example, heart attack and myocardial Calculations show that the system is able to achieve infarction are the same diseases. But ICD contains only one 0.25(±0.23) precision, 0.56(±0.32) recall, and 0.31(±0.19) F- name of this disease - myocardial infarction. The created score value. To give a better perspective where the system’s system is not able to recognize the heart attack as a synonym failures were and possible reasons for that, Pearson correlation of myocardial infarction. coefficient calculations between various indicators were done Additionally, the active substances, mentioned in the (Table 6). The biggest impact on F-Score had incorrectly leaflet, were translated into the Latin language (nominative detected links to ICD, a coefficient was -0.89. The reason why and genitive grammatical cases). This was done because the precision was so low is that of the high ratio between database of active substances, that was provided, has three automatically and manually detected links. versions of translation: Lithuanian, Latin in the nominative case and Latin in the genitive case. TABLE V. EXPERIMENT RESULTS ID Auto. Auto. Man. Precision Recall F-Score Ratio of Err. links Err. links Err. links detected correctly detected links to ICD to ATC to active links detected links amounts substances links 13092 1906 346 385 0.18 0.90 0.30 4.95 82% 100% 65% 13571 1899 367 444 0.19 0.83 0.31 4.28 81% 100% 58% 859 87 67 162 0.77 0.41 0.54 0.54 17% 100% 100% 1300 400 28 146 0.07 0.19 0.10 2.74 98% 100% 24% 10958 464 14 71 0.03 0.20 0.05 6.54 100% 25% 21% 1872 283 66 68 0.23 0.97 0.38 4.16 77% 100% 43% 5363 473 237 291 0.50 0.81 0.62 1.63 46% 88% 49% 13273 158 51 72 0.32 0.71 0.44 2.19 45% 100% 100% 10744 1199 150 175 0.13 0.29 0.18 6.85 87% 100% 100% 16551 1090 120 204 0.11 0.25 0.15 5.34 90% 87% 51% Median 468.5 93.5 168.5 0.185 0.56 0.305 4.22 82% 100% 55% Q1 312.25 54.75 90.5 0.115 0.26 0.158 2.328 54% 91% 45% Q3 1171.75 215.25 269.25 0.298 0.825 0.425 5.243 89% 100% 91% Avg 795.9 144.6 201.8 0.25 0.56 0.31 3.92 72% 90% 61% Std dev 686.52 129.45 132.27 0.23 0.32 0.19 2.10 27% 23% 30% Min 87 14 68 0.03 0.19 0.05 0.54 17% 25% 21% Max 1906 367 444 0.77 0.97 0.62 6.85 100% 100% 100% 113 TABLE VI. CORRELATION OF ESTIMATES AND INDICATORS medications registered in Lithuania. Extracted data was used in the pilot project for extending the functionality of the Estimates system “https://gydytojams.vaistai.lt”. The additional Precision Recall F-Score function supports physicians in search of possible contraindications that are relevant to patient medical records. Incorrectly detected links to -0.9655 -0.3114 -0.8939 Moreover, physicians have the possibility to give feedback ICD list amount about erroneous contraindications presented. In such a way Incorrectly detected links to 0.3292 0.4184 0.4382 ATC list amount they help in expanding the list of phrases to be ignored and Indicator Incorrectly detected links to 0.5229 0.1244 0.4523 eliminating incorrect contraindication links. active substances list The experiment shows that approximately 56% of amount contraindications are found but only every fourth is correct. Automatically and manually -0.8119 -0.2583 -0.7682 detected contraindications Several changes in the algorithm remain for future work. First, ratio before the noun phrase is looked up in databases, a context must be identified. This would reduce the number of incorrect links. Second, to detect phrases that refer to medication C. Conclusions of the experiment analyzed and to ignore them. The experiment shows that the system automatically successfully detected more than half of the relevant ACKNOWLEDGMENT contraindication links (56%). But 75% of links were Data for this system was provided by JSC Skaitos erroneous and the system lacks precision. The reason for that kompiuterių servisas is a high number of incorrect links to ICD (r=-0.9655), this indicator has the most negative impact on the precision and F- REFERENCES Score results. This might be because of commonly used [1] VVKT prie LR SAM, "Įsakymas 2015 m. liepos 3 d. Nr.(1.72E)1A- phrases that are not contraindications but used in the ICD list. 755 Dėl paraiškų registruoti vaistinį preparatą, perregistruoti vaistinį preparatą, pakeisti registracijos pažymėjimo sąlygas, teisės į vaistinio For example, the word allergy does not imply that this is a preparato registraciją perleidimo, nereglamentiniam pakuotės ir (ar," contraindication and must be ignored. Another reason for low 03 07 2016. [Online]. estimates results is, the number of detected contraindications [2] European Medicines Agency, "European Medicines Agency," 02 2019. phrases. Calculations show, that the higher is the difference [Online]. between automatically and manually detected [3] S. Sahay, E. Agichtein, B. Li, E. V. Garcia and A. Ram, "Semantic contraindications phrases, the lower are precision and F-Score Annotation and Inference for Medical Knowledge Discovery," 2007. results. The reason for that is, a high number of noun phrases [Online]. that are irrelevant to contraindications noun phrases, for [4] C. Tao, D. Song, D. Sharma and C. G. Chute, "Semantator: Semantic example, pill, driving. annotator for converting biomedical text to linked data.," Journal of Biomedical Informatics, vol. 46, no. 5, pp. 882-893. 12p., Oct2016. Additionally, considering why F-Score is so low (0.31) the [5] M. Bundschus, M. Dejori, M. Stetter, V. Tresp and H.-P. Kriegel, assumption that this is because of low precision (0.25) can be "Extraction of semantic biomedical relations from text using done. To raise this indicator the list of phrases to be ignored conditional random fields.," BMC Bioinformatics, vol. 9, pp. 1-14, 2008. (common word and phrases) must be used. The most frequent [6] Y. Zhang, H.-Y. Wu, J. Xu, J. Wang, S. Ergin, L. Li and H. Xu, reasons for the incorrect detection of contraindications are: "Leveraging syntactic and semantic graph kernels to extract  the context of the phrase in the sentence is not taken pharmacokinetic drug drug interactions from biomedical literature.," BMC Systems Biology, vol. 107, pp. 323-334 12p., 8/26/2016. into account; [7] R. Frank, Phrase Structure Composition and Syntactic Dependencies,  Conjunctions are not taken into account and two or vol. 38, Cambridge, Mass: The MIT Press, 2002, pp. 2-27. more noun phrases (i.e. “…kidney and liver [8] Damaševičius, R., Napoli, C., Sidekerskienė, T. and Woźniak, M., 2017. IMF mode demixing in EMD for jitter analysis. Journal of diseases…”) are not identified; Computational Science, 22, pp.240-252.  Brackets that are used to specify contraindication are [9] Kaunas University of Technology and Vytautas Magnus University, not taken into account (“…liver tumor (malignant or "Lietuvių kalbos sintaksinės ir semantinės analizės informacinė sistema," [Online]. benign)…”). [10] A. Holvoet, Bendrosios sintaksės pagrindai, Vilnius: Vilniaus To avoid errors caused by those reasons, users of Universitetas, Asociacija „Academia Salensis“, 2009. “https://gydytojams.vaistai.lt” IS will be able to mark [11] D. Jurafsky and J. H. Martin, "Formal Grammars of English," in contraindication as erroneous and if the pharmacist approves Speech and Language Processing (2Nd Edition), JAV, Prentice-Hall, Inc., 2009, pp. 396-408. that it will be removed from the database. [12] D. Šveikauskienė, "Lietuvių kalbos sintaksinė analizė," Lietuvių kalba, vol. 7, 2013. V. CONCLUSIONS [13] Wózniak, M., Połap, D., Nowicki, R.K., Napoli, C., Pappalardo, G. and In this paper, the system which automatically detects Tramontana, E., 2015, July. Novel approach toward medical signals contraindications and links them to existing “Skaitos classifier. In 2015 International Joint Conference on Neural Networks kompiuterių servisas” databases have been introduced. (IJCNN), pp. 1-7 . IEEE. System analyses text of medications leaflets, it extracts noun [14] Valstybinė ligonių kasa, "TLK-10-AM / ACHI / ACS elektroninis vadovas," [Online]. phrases and links them to corresponding items in ATC, ICD, and active substances list. The system presented was used for [15] Norwegian Institute of Public Health, "WHOCC - Structure and principles," [Online]. the extraction of contraindications from leaflets of all 114