=Paper=
{{Paper
|id=None
|storemode=property
|title=Some Remarks on Automatic Semantic Annotation of a Medical Corpus
|pdfUrl=https://ceur-ws.org/Vol-744/paper5.pdf
|volume=Vol-744
}}
==Some Remarks on Automatic Semantic Annotation of a Medical Corpus==
<pdf width="1500px">https://ceur-ws.org/Vol-744/paper5.pdf</pdf>
<pre>
          Some Remarks on Automatic
     Semantic Annotation of a Medical Corpus

                 Agnieszka Mykowiecka, Małgorzata Marciniak

            Institute of Computer Science, Polish Academy of Sciences,
                      J. K. Ordona 21, 01-237 Warsaw, Poland
                              agn,mm@ipipan.waw.pl


      Abstract. In this paper we present arguments that elaborating a rule
      based information extraction system is a good starting point for obtain-
      ing a semantic annotated corpus of medical data. Our claim is supported
      by evaluation results of the automatic annotation of a corpus containing
      hospital discharge reports of diabetic patients.


1   Introduction
Many current methods of recognizing various types of information included
within natural language texts are based on statistical and machine learning ap-
proaches. Such applications need specially prepared domain data for training
and testing. Clinical texts are hard to obtain because of privacy laws, in par-
ticular, none of the Polish corpora include this type of texts. Corpora available
during the past decade more often contain biomedical than clinical texts (e.g.
corpora described in [1]). Recently, creating corpora containing clinical data has
started to attract much more attention, e.g. (Cincinnati Pediatric Corpus http:
//computationalmedicine.org/cincinnati-pediatric-corpus-available,
[2], or data collected within Informatics for Integrating Biology and the Bed-
side (i2b2) https://www.i2b2.org/NLP/DataSets/Main.php). This year, the
Text REtrieval Conference (TREC) added the Medical Records Track devoted
to exploring methods for searching unstructured information in patient medical
records. In nearly all existing resources, semantical annotation is absent or very
limited. The one of the few exceptions is CLEF [10] which contains cancer pa-
tient records annotated with information about clinical relations, entities, and
temporal information.
    There are two main approaches to the task of annotating new linguistic data
– manual annotation, and manual correction of automatically assigned labels.
The traditional annotation methodology consists in preparing and accepting an-
notation guidelines, annotating every text by at least two annotators and finally,
resolving differences by a third experienced annotator. This approach, applied
to part-of-speech annotation, is described in [9], semantic manual annotation is
described in [10] or [12]. Manual annotation is a time-consuming and expensive
process, moreover manual work is error-prone. Manually constructed data are
very hard to extend and modify – every change imposes extra effort for check-
ing the consistency of the result. Therefore, providing automatic methods to


                                       35
  2
Some Remarks on Automatic Semantic Annotation of a Medical Corpus

  facilitate the task is very important. Automatic annotation is much faster and
  although it also does not guarantee complete correctness, the cost of correcting
  already labeled data is lower than the cost of entirely manual annotation. Au-
  tomatic annotation of data was applied in the MUCHMORE project [13]. The
  methods described in [11] can support automatic annotation of textual contents
  with SNOMED concepts.
      A good starting point for automatic annotation are methods of Information
  Extraction (see [6]) based on regular expressions and lexicons (e.g., [3]), which do
  not require annotated corpora as machine learning techniques do. In this paper
  we discuss the results of annotating a corpus of Polish diabetic records with a
  set of complex semantic labels consisting of about 50 attributes. For this task
  we reused an already existing rule based IE extraction system. In section 2 we
  present the method used to create the annotated corpus and the methodology
  accepted for the evaluation process. Then, in section 3 we describe the results
  obtained. The paper concludes with a discussion of the evaluation results.


  2     Method
  2.1   Data description
  The corpus consists of 460 hospital discharge reports of diabetic patients, col-
  lected from 2001 to 2006 in one of the hospitals in Warsaw. Each document is
  about 1.5 – 2.5 pages long and written in MS Word. The documents are converted
  into plain text files to facilitate their linguistic analysis and corpus construction.
  As the data include information serving identification purposes (names and ad-
  dresses) they were substituted with symbolic codes before making the documents
  accessible for analysis. The anonymization task was performed in order to make
  the data available for scientific purposes.
      The entire dataset contains about 1,800,000 characters in more than 450,000
  tokens, out of which 55% are words, abbreviations and acronyms, while 45% are
  numbers, punctuation marks and other symbols.

  2.2   Automatic annotation process
  In contrast to many annotated text corpora which were built by manually as-
  signing labels to appropriate text fragments, we decided to adopt an existing IE
  system [8] for the task. However, after inspecting the IE system’s results it turned
  out that they do not contain all the information needed. For the IE system, the
  main goal was to find out whether a particular piece of information is present
  in an analyzed text, while the task of text annotation requires identifying the
  boundaries of text fragments which are to be assigned a given label. To solve the
  problem, the idea of combining two extraction grammars was introduced. On
  the basis of the existing grammar a simplified version, consisting of a subset of
  the original rules, was created. The final information associating text fragments
  with semantic labels is the effect of a comparison of the results of these corre-
  lated IE grammars. The limits of text fragments representing attribute values


                                          36
                                                                                    3
Some Remarks on Automatic Semantic Annotation of a Medical Corpus

  are recognized in the simplified grammar, while their correctness is justified by
  more complex grammar rules which describe the contexts in which a particular
  phrase has a desired meaning. Thus, the annotation process (described in detail
  in [4]) consists of the following steps:

   – parsing the text with the existing full extraction grammar,
   – parsing the entire text using the simplified grammar,
   – removing unnecessary information from the output of both grammars,
   – comparing and combining the results – only structures that are represented
     in both results are represented in the final corpus data together with infor-
     mation on boundaries of the entire phrase and its subphrases,
   – combining the semantic information with morphological information (see [5])
     to create a set of corpus XML files,
   – manual correction of annotations.


  2.3   Annotated data

  Within the semantic annotation layer about 50 simple attributes, 11 complex
  structures and 3 lists types are defined. Below, they are described in the same
  groups as the evaluation of the annotation given in the table (1).

   – Identification of a patient’s visit in hospital: visit identification number and
     information if it is a main document or a continuation; the date of the
     document; dates when the hospitalization took place.
   – Patient information: a structure with the patient’s identifier, sex and simple
     attributes representing age, height, weight (in numbers or words) and BMI.
   – Data about diabetes (in some cases grouped in a feature l str structure),
     e.g.: type (d type); if the illness is balanced (d control); when diabetes
     was first diagnosed (expressed as absolute or relative date); reasons of hospi-
     talization (as a list of attributes); and results of basic tests: HbA1c, acetone,
     LDL, levels of microalbuminury and creatinine.
   – Complications, other illnesses including autoimmunology and accompanying
     illnesses, which may be correlated with diabetes.
   – Diabetes treatment described by: insulin treat str that contains insulin type
     and its doses; description of continuous insulin infusion therapy (ins inf treat);
     description of oral medications; information that insulin therapy was started.
     The applied therapy is sometimes given as a list of information that is rep-
     resented by a cure l str list of attributes.
   – Diet description represented by diet str that contains information on type
     of diet (diet type), and structures describing how many calories are rec-
     ommended and a similar structure representing numbers of meals.
   – Information on therapy given in text form, e.g.: patient’s education, diet
     observeing, therapy modiffication, self monitoring.

     Some of the attributes have values representing dates, e.g. hospit structure
  has two substructures describing the beginning and the end of a visit in hospital


                                         37
  4
Some Remarks on Automatic Semantic Annotation of a Medical Corpus

  (h from and h to). To correctly label these attributes it is necessary to recog-
  nize the different formats of dates and the appropriate contexts indicating the
  meaning of a date. Dates are also recognized in case of document begining, and
  for representing date when diabetes was first diagnosed.
      Most attributes representing results of tests have numbers as values. They are
  usually attached to short phrases consisting of an introductory phrase indicating
  a type of a test and its value, sometimes after one of the following characters:
  ‘=, :, -’. Values can also be given in brackets. Only the results of LDL cholesterol
  levels need a wide context, because they are represented in a table form together
  with other test results. This explains the average length of 27 tokens of a phrase
  representing lipid str indicating the context of the ldl attribute.
      Some attributes, having boolean values, label relatively short phrases like
  results of acetone tests. For example, a negative value is attached to the follow-
  ing strings: ac. (-, ac. -, ac. /-, ac. nieobecny ‘absent’, bez acetonurii ‘without
  acetone in urine’, ustąpiła acetonuria or ustąpienie acetonurii ‘acetone in urine
  subsided’. Boolean values also have attributes that are represented by many dif-
  ferent, sometimes long, phrases. For example, the information if a therapy of
  diabetes was modified or not is represented in the test set after correction by 23
  different phrases of average length of 4.3 tokens.
      Attributes of the last group have many values of different types. For example
  attribute complication has 17 different values. It is usually attached to a short
  phrase (avg. 2.2 tokens) representing just the complication name. Longer phrases
  (avg. 5 tokens) represent the opposite information (n comp) when a particular
  complication was not diagnosed or there are no complications. These phrases
  have to contain a phrase like: nie wykryto ‘not diagnosed’.

  3   Results
  In the corpus consisting of 460 patient records, 66165 occurrences of simple
  attributes were labeled. To check the quality of the results, manual verification
  of a randomly selected 10% of the corpus (46 records, 46439 tokens) was done
  by two annotators, who were given the following guidelines:
   – Structures should be assigned to continuous phrases. i.e. to all tokens be-
     tween the first and last tokens of the phrase.
   – Boundaries of a phrase to which a label is assigned are determined on the
     basis of sets of words that may start and end the phrase.
   – In case of phrases that represent information that should be taken into ac-
     count, but were not predicted by the grammar designer, annotators have
     to rely on their own opinion which words belong to such a phrase. If it is
     possible, similar rules to those described in the guidelines should be applied.
   – Annotators have to point out information that is understandable to human
     readers, so phrases with spelling errors should be annotated.
     The results of the manual corrections of the system’s output made by the two
  annotators were then compared and the agreed version was accepted as a Gold-
  standard version. The final number of differences between the automatically


                                         38
                                                                                 5
Some Remarks on Automatic Semantic Annotation of a Medical Corpus

  obtained annotation and the Gold-standard concerned 596 token labels (1.3%).
  Human corrections mainly concerned the addition of new labels (79 structures
  – 554 tokens). Deletion of mistakenly recognized structures were much less fre-
  quent (4 structures – 20 tokens); very few changes concerned only the boundaries
  or the name of the structure. 283 corrections were proposed consistently by both
  annotators. Kappa coefficient for annotators agreement counted for all word-
  label pairs was equal to 0.976 if empty labels were counted (for total 46439
  occurrences) and 0.966 when they were ignored (9031 occurrences). The agree-
  ment between the corrected version and the automatically annotated set was
  equal to 0.94. Inter-annotator agreement counted only for structures beginnings
  (3308) was equal to 0.976.
      The corrected results were compared with the automatically annotated data.
  In general, the verification of 9057 non empty labels showed that automatic an-
  notation achieved an accuracy equal to 0.987, precision – 0.995, recall – 0.936
  and 0.966 f-measure value. Precision was equal to 1.00 for all attributes but
  doc dat and comp and for dose str and insulin treat str structures. Recall and
  F-measures values for all attributes and structures which occurred in the evalu-
  ation set are given in table 1. Errors can be classified into 3 groups:

   – Omissions and mistakes of the system: dieta cukrzycowa wysokobiałkowa 188
     kcal, 3 posiłki ‘diabetic high protein diet 1800 kcal, 3 meals’ – we did not
     recognize a diet of type ‘diabetic and high protein’; the system did not
     label information on an obesity of a patient, when it was expressed in Latin
     ‘obesitas’ instead of Polish ‘otyłość’.
   – Spelling errors or punctuation errors in the original data in words that are
     crucial for rules: wlew podsttawowy instead of podstawowy ‘base infusion’;
     pRetinopathia; masa ciała103 ‘weight103’.
   – Information represented by phrases not predicted by the extraction gram-
     mars, or difficult to label by the system because of ambiguity (examples
     discussed in section 4).

      As evaluations based on verifying system output can be biased towards types
  of phrases which are recognized by the system and may result in the omission
  of other types of phrases which represent the same information, we performed a
  second type of evaluation. We manually compared the automatically generated
  annotation with a manual annotation which was done without seeing the system
  results. For this purpose, 5 discharge records randomly selected from the Gold-
  standard subcorpus were annotated manually. It took a well trained person 250
  minutes (correction of automatic annotation took less than 1 hour) and the
  F-measure of the results in comparison to the Gold-standard annotation was
  equal to 0.86. Kappa coefficient between manually obtained annotation and the
  corrected system output was equal to 0.87 when all word-label pairs were counted
  and 0.82 for structures’ beginnings. Lower coefficient value was due to annotator
  inattention resulted in omission of information or indicating an inappropriate
  text fragment to a label. The agreement between the corrected version and the
  automatically annotated set was equal to 0.94.


                                        39
  6
Some Remarks on Automatic Semantic Annotation of a Medical Corpus
                 Table 1. Semantic label diversity and label verification


       structure/attribute         numb of occurr.     F-          phrase avg phrase words
                                  G-std. rule based measure recall types    length    total
                        administrative information
       DOC BEG                        46         46       1      1      1           2   92
       DOC DAT                        37         38    0.99      1      1           1  298
       id str                         46         45    0.99 0.98        2           4  183
         ID                           46         45    0.99 0.98        –           –     –
         CONT                         45         45       1      1      –           –     –
       hospit str                     46         43    0.97 0.93        7          18  831
         H FROM                       46         43    0.97 0.93        2           7  323
         H TO                         46         43    0.97 0.93        2           7  323
       EPIKRYZA BEG                   46         46       1      1      1           1   46
       recommendation str             44         44       1      1     13         5.6  248
         RECOMMEND BEG                44         44       1      1      1           2   88
                        basic patient data
       id pat str                     46         45    0.99      1     11           4  191
         id pat sex                   46         45    0.99      1      3         2.2  100
           ID PAT                     46         45    0.99      1      –           –     –
           ID P SEX                   46         45    0.99      1      –           –     –
         ID AGE                       46         45    0.99      1      6           2   91
       W IN WORDS                      6          5    0.91 0.83        4           1     6
       WEIGHT                         40         39    0.99 0.97        6         3.5  138
       BMI                            33         33       1      1      3         3.6  119
       HEIGHT                         40         39    0.99 0.97        3           2   80
                        basic diabetes data
       D CONTROLL                     30         27    0.95 0.90       18         1.8   55
       FROM IN W                       1          0       –      –      1           2     2
       HBA1C                          59         54    0.96 0.92        8           5  299
       ACET D                         42         42       1      1      4           2   85
       creatinin str                  43         41    0.98 0.95        7         4.4  191
       microalbuminury str            13         12    0.96 0.92        6           5   65
       lipid str                      31         27    0.93 0.87        6          27  834
         LDL1                         31         27    0.93 0.87        3         2.3   78
       feature l str                  91         91       1      1     59         5.7  518
         COMP                          5          5       1      1      4         1.6     8
         D CONTROLL                   34         34       1      1      9         1.2   40
         D TREAT                      24         24       1      1      6           2   49
         D TYPE                       70         70       1      1      2           2  139
         FROM IN W                    10         10       1      1      5           1   10
         RELATIVE DATA                19         18    0.97 0.95        7           3   58
         W IN WORDS                   10         10       1      1      4               10
       reason l str                   30         27    0.95 0.90       27       12.3   370
         D CONTROLL                   40         37    0.85 0.95       19         2.3   94
         KETO D                        2          2       1      1      2           1     2
         KWAS D                        1          1       1      1      1           2     2
         RELATIVE DATA                 1          1       1      1      1           4     4
         SELF MONITORING               1          1       1      1      1           1     4
                        complication and acc diseases
       ACC DISEASE                    48         48       1      1      3           1   48
       COMP                          134        132    0.97 0.96       49         2.2  294
       N COMP                         27         15    0.71 0.56       11           5  134
                        therapy
       insulin treat str             446        444    0.99 0.99      103         5.7 2531
         I TYPE                      439        436    0.99 0.99       23         1.7  746
         dose str                    441        440    0.99 0.99        8         3.1 1363
         corr str                      2          1    0.67    0.5      2           6   12
           DOSE MODIFF                 2          1    0.67    0.5      1           1     2
           THERAPY MODIFF              2          1    0.67    0.5      1         2.5     5
       diet str                       47         44    0.97 0.94       29         7.8  366
         DIET TYPE                    47         44    0.97 0.94        4         2.1  100
         cal str                      47         44    0.97 0.94        6         2.8  131
           CAL MIN                    47         44    0.97 0.94        –           –     –
         meals str                    45         41    0.95 0.91        8         2.2   99
           MEALS MIN                  45         41    0.95 0.91        –           –     –
       ORAL TREAT                     63         63       1      1     18         1.2   75
       I THERAPY BEG                   4          1    0.40 0.25        4         5.3   21
       THERAPY MODIFF                 24         19    0.88 0.79       23         4.3  103
       DOSE MODIFF                     9          8    0.94 0.89        6         3.3   30
       DIET CORRECTION                 2          2       1      1      2           3     6
       SELF MONITORING                 0          2       –      –      3           1     3
       EDUCATION                      27        40
                                                 25    0.96 0.93       20           8  215
                                                                                  7
Some Remarks on Automatic Semantic Annotation of a Medical Corpus

  4   Discussion and Conclusions
  Standard information given as numbers or dates is often easy to recognize au-
  tomatically by any rule based system. The vast majority of such data is la-
  beled correctly, yet sometimes there are problems as a result of unpredicted long
  phrases representing the desired information. These errors should be corrected
  during manual verification of the corpus.
      For example, the phrase HbA1c przy przyjęciu do Kliniki wynosiło 7,8%
  ‘HbA1c level at the time of admission to hospital was 7.8%’ contains infor-
  mation that is usually represented by ‘HbA1c = 7,8%’. As rule based systems
  are greedy, rules have to be relaxed carefully. For example if we allow several
  tokens between the introductory string HbA1c and a number in a rule assigning
  the hba1c attribute, it may recognize another number as the value (for HbA1C
  9 %, HbA1 11,3 % the value assigned would be 11.3%). It is possible to relax
  the extraction grammar rules by imposing restrictions on tokens that appear
  between the ‘HbA1c’ token and its value, e.g. a word which has the base form
  przyjęcie ‘admission’.
      The second reason for attribute omission is paraphrasing. Natural language
  allows us to express the same information in many ways. Thus, it is extremely
  difficult to write a system that correctly recognizes all possible phrases. For
  instance, in the interpretation of the following phrase: pacjentka z cukrzycą typu
  1 została przyjęta do Kliniki z powodu chwiejnego przebiegu choroby ‘patient with
  diabetes type 1 was hospitalized in the Clinic because of the unstable course of
  illness’ it is necessary to know that the illness mentioned in the second part of
  the sentence refers to diabetes, to recognize the reason for hospitalization.
      Another example that is easy for a human annotator but caused problems
  in automatic annotation was when context was disregarded for a test result. We
  assume that phrases like cukrzyca typu 2 ‘diabetes type 2’ indicate type of pa-
  tient’s diabetes. But for the following phrase pacjent obciążony rodzinnie, mama
  i babcia z cukrzycą typu 2 ‘patient with family burden, mother and grandmother
  with diabetes type 2’ this is not true. Another difficult example is the phrase
  dawka dodatkowa 21.00 - 2j. Humalog ‘additional dose 21.00 - 2j Humalog’,
  where the string ‘21.00’ was not recognized as a time description but as a dose.
      The biggest problem for automatic rule based semantic annotation stems
  from phrases that require a very wide context. For example, it is impossible to
  correctly interpret the following phrase: Wprowadzono intensywną insulinoter-
  apię ‘Intensive insulin therapy was introduced.’ This phrase is a candidate for
  i therapy beg indicating the introduction of insulin into a patient’s therapy.
  Unfortunately, from this phrase we do not know if the verb ‘introduce’ refers
  to ‘insulin’ or to the word ‘intensive’ — a feature of the therapy. This problem
  could be resolved only by a human annotator (and not always) after an analy-
  sis of other information in the document. For example, if there is information
  on newly diagnosed diabetes or previous oral therapy, the phrase shall be la-
  beled with the i therapy beg attribute, whereas if there is information that
  patient was cured with continuous insulin infusion therapy, the phrase shall not
  be labeled with the i therapy beg attribute.


                                        41
  8
Some Remarks on Automatic Semantic Annotation of a Medical Corpus

      The semantic annotation of text corpora is domain and application related.
  As for a new purpose, a new annotation is usually necessary, all methods of
  increasing the efficiency of the annotation procedure are very desirable. In the
  paper we presented the evaluation results of a corpus annotation obtained using
  IE grammars. The results turned out to be of a quality good enough for statistical
  purposes [7]. The advantage of designing an IE system instead of preparing
  only guidelines for manual annotation is its flexibility – the set of rules may be
  changed and a slightly different resource with a high degree of consistency can be
  produced, whilst changing the manually annotated resource is more error-prone
  and time consuming.

  References
   1. Cohen, K.B., Fox, L., Ogren, P.V., Hunter, L.: Corpus design for biomedical natural
      language processing. In: ACL-ISMB Workshop on Linking Biological Literature,
      Ontologies and Databases: Mining Biological Semantics. pp. 38–45. Detroit (2005),
      http://www.aclweb.org/anthology/W/W05/W05-1306
   2. Dalianis, H., Hassel, M., Velupillai, S.: The Stockholm EPR corpus - characteristics
      and some initial findings. to be published. In: in Proceedings of the 14th Interna-
      tional Symposium for Health Information Management Research. pp. 14–16 (2009)
   3. Gold, S., Elhadad, N., Zhu, X., Cimino, J.J., Hripcsak, G.: Extracting Structured
      Medication Event Information from Discharge Summaries. In: AMIA Annual Sym-
      posium Proceedings. p. 237–241 (2008)
   4. Marciniak, M., Mykowiecka, A.: Construction of a medical corpus based on infor-
      mation extraction results. Control and Cybernetics (in print) (2011)
   5. Marciniak, M., Mykowiecka, A.: Towards Morphologically Annotated Corpus of
      Hospital Discharge Reports in Polish. In: Proceedings of BioNLP 2011 (2011)
   6. Meystre, S.M., Savova, G.K., Kipper-Schuler, K.C., Hurdle, J.F.: Extracting infor-
      mation from textual documents in the electronic health record: A review of recent
      research. IMIA Yearbook 2008: Access to Health Information pp. 128–144 (2008)
   7. Mykowiecka, A., Marciniak, M.: Automatic semantic labeling of medical texts with
      feature structures. In: Text, Speech and Dialogue. Proceedings of the TSD 2011,
      Plzen, Czech Republic, 2011, LNAI, Springer (2011, accepted for publication)
   8. Mykowiecka, A., Marciniak, M., Kupść, A.: Rule-based information extraction from
      patients’ clinical data. Journal of Biomedical Informatics 42, 923–936 (2009)
   9. Pakhomova, S.V., Codenb, A., Chutea, C.G.: Developing a corpus of clinical notes
      manually annotated for part-of-speech. International Journal of Medical Informat-
      ics 75, 418–429 (2006)
  10. Roberts, A., Gaizauskas, R., Hepple, M., Demetriou, G., Guo, Y., Roberts, I.,
      Setzer, A.: Building a semantically annotated corpus of clinical texts. Journal of
      Biomedical Informatics 42(5), 950–966 (2009)
  11. Ruch, P., Gobeill, J., Lovis, C., Geissbühler, A.: Automatic medical encoding with
      SNOMED categories. BMC Medical Informatics and Decision Making 8 (2011)
  12. South, B.R., Jones, M., Garvin, J., Samore, M.H., Chapman, W.W., Gundlapalli,
      A.V.: Developing a manually annotated clinical document corpus to identify pheno-
      typic information for inflammatory bowel disease. BMC Bioinformatics, 10 (2009)
  13. Vintar, S., Buitelaar, P., Ripplinger, B., Sacaleanu, B., Raileanu, D., Prescher, D.:
      An efficient and flexible format for linguistic and semantic annotation. In: In Third
      International Language Resources and Evaluation Conference, Las (2002)


                                            42

</pre>