=Paper=
{{Paper
|id=Vol-1749/paper4
|storemode=property
|title=Relation Mining from Clinical Records
|pdfUrl=https://ceur-ws.org/Vol-1749/paper4.pdf
|volume=Vol-1749
|authors=Anita Alicante,Anna Corazza,Francesco Isgrò,Stefano Silvestri
|dblpUrl=https://dblp.org/rec/conf/clic-it/AlicanteCIS16
}}
==Relation Mining from Clinical Records==
<pdf width="1500px">https://ceur-ws.org/Vol-1749/paper4.pdf</pdf>
<pre>
                           Relation mining from clinical records
                     Anita Alicante, Anna Corazza, Francesco Isgrò
         Department of Electrical Engineering and Information Technologies (DIETI)
                               Università di Napoli Federico II
                             via Claudio 21, 80125 Napoli, Italy
        {anita.alicante|anna.corazza|francesco.isgro}@unina.it

                                           Stefano Silvestri
             Institute for High Performance Computing and Networking, ICAR-CNR
                              via P. Castellino, 111, 80131 Napoli, Italy
                             stefano.silvestri@icar.cnr.it

                     Abstract                            in entities and relations connecting them (Alicante
                                                         et al., 2016b). In the cited work, we extensively
    English. We propose a system to extract              discuss a domain entity and relation recognition
    entities and relations from a set of clini-          system for Italian. Such step is at the basis of more
    cal records in Italian based on two preced-          sophisticated analyses, including semantics-based
    ing works (Alicante et al., 2016b) and (Al-          indexing of documents for improved retrieval, ad-
    icante et al., 2016a). This approach does            vanced query based information extraction, and
    not require annotated data and is based on           the application of ontology-based strategies for
    existing domain lexical resources and un-            privacy protection.
    supervised machine learning techniques.
                                                            General tools, such as TextPro (Pianta et al.,
    Italiano. Proponiamo un sistema per e-               2008), are not adapted for technical domains such
    strarre entità e relazioni da un insieme di         as the medical one, as they are trained on generic
    cartelle cliniche in Italiano basato su due          documents, rather than domain-specific ones. Fur-
    precedenti lavori (Alicante et al., 2016b) e         thermore, a lot of tools are available for English
    (Alicante et al., 2016a). Questo approc-             and only a few of them have been ported to Italian.
    cio non richiede dati annotati e si basa             Another problem to take into account is the occur-
    su risorse lessicali di dominio già esistenti       rence, in clinical records, of typos and nonstan-
    e tecniche di apprendimento automatico               dard abbreviations, in addition to the most usual
    senza supervisione.                                  acronyms. Last but not least, passing from text to
                                                         knowledge processing raises tricky privacy prob-
                                                         lems. In fact, especially but not only in small hos-
1   Introduction                                         pitals, obscuring the patient names is not sufficient
The digitization of medical documents in hospitals       to hide their identity as the medical information re-
has produced plenty of information which should          ported in records are often sufficient to reconstruct
be adequately organized. While part of the mate-         a precise profiling of the patients.
rial, mainly including international scientific pub-        Therefore, ad hoc solutions represent the only
lications, is in English, increasingly more mate-        way to build effective applications to solve this
rial is being created in the language of the country     kind of problems. For example, not only domain
of the medical institution. The main part of the         entities and relations can help identifying poten-
local language material is represented by patient        tially dangerous information, but also ontological
records. They contain important information not          information can be exploited to better protect pa-
only for preparing care plans or solve problems for      tient privacy (Bonatti and Sauro, 2013). Again,
the particular patient, but also to extract statistics   ontologies construction and population are based
useful for research and also for logistics adminis-      on entity and relation extraction.
tration.                                                    Efforts to port systems to languages different
   Automatic processing of such repositories still       from English require, first of all, the development
can not be straightforwardly applied. One of the         of lexical resources for the considered language.
principal issues to be solved is the automatic ex-       However, they are not sufficient, because of the
traction of relevant information, usually consisting     intrinsic differences between languages. A widely
adopted way to tackle such difficulties is repre-        proach proposed in (Alicante et al., 2016a). The
sented by machine learning approaches.                   decision about how a relation can be labeled is
   Although supervised approaches are usually            only based on the terms involved in the corre-
more effective, they require large corpora of an-        sponding entity pair, without considering the con-
notated data, which are quite expensive to obtain,       text in which it occurs. In fact, this is complemen-
as they require that domain experts invest time          tary with respect to the task of deciding whether
in a long and tedious annotation activity. In the        two entities are related, which should be decided
medical domain, staff should invest part of their        on the basis of the context where the two entities
precious time to annotate data with information          occur, as in (Alicante et al., 2016b). On the other
about the presence and the type of domain rele-          hand, by considering only the two involved en-
vant entities and relations in records to be used for    tities, we can only decide the type of a relation.
the training phase. Things would be much eas-            Then, to decide whether the relation is stated or
ier if domain experts are only required to check         negated, also the context should be considered in
an automatically produced annotation. We there-          the analysis.
fore propose to integrate a knowledge-based and              The third module of the framework is based on
a text mining approaches to develop an applica-          Word Embeddings (WEs) (Mikolov et al., 2013)
tion which requires the expert intervention only to      to represent the words involved in each entity with
check on medical and pharmaceutical labels asso-         a real valued array. WEs most interesting char-
ciated to groups of relations.                           acteristic consists in the fact that the mutual posi-
   More in detail, we propose here to integrate          tion of words in a metric space strongly depends
the systems discussed in (Alicante et al., 2016b)        on their meanings, so that words having similar
and in (Alicante et al., 2016a): the former adopts       semantics have large similarity, when this is com-
domain dependent lexical resources to extract            puted, for example, by cosine similarity. Embed-
entities and unsupervised machine learning ap-           dings can be automatically built from a large col-
proaches to decide where relations occur in the          lection of unannotated text with a very efficient al-
text. The latter clusters and labels the extracted       gorithm. Therefore, they can be easily applied to
relations with an approach based on lexical seman-       any language, in our case to Italian, provided that
tics.                                                    enough texts are available. We used documents
   The paper is organized with Section 2 detail-         extracted from Wikipedia for training. In particu-
ing the approach implementation and Section 3 for        lar, we considered pages flagged as Medicine, Bi-
conclusions and future works.                            ology and Pharmacy in Italian. For the extraction,
                                                         we used CatScan v3.01 , Wikipedia Export tool2
2   Proposed approach                                    and Wikiextractor3 .
                                                             For each entity, we then consider the embed-
The framework proposed is composed by three
                                                         dings corresponding to each token. As shown
modules, and its logical structure is depicted in
                                                         in (Paperno and Baroni, 2016), a good represen-
Figure 1. The first one is devoted to domain entity
                                                         tation for a string of words is given by the sum
(i.e., medical and pharmaceutical entities) iden-
                                                         of the corresponding WEs. However, as we do
tification and classification, and exploits domain
                                                         not want that such representation depends on the
related lexical resources and standard natural lan-
                                                         string length, we normalize the sum by the number
guage tools. The second one is based on an unsu-
                                                         of words involved in the entity, obtaining the av-
pervised machine learning approach, namely clus-
                                                         erage or centroid of the corresponding WEs. Each
tering, to avoid the necessity of annotating data,
                                                         pair of entities occurring in the same sentence rep-
for the relation extraction. A potential relation is
                                                         resents a possible candidate for a relation. We
hypothesized among all pairs of the entities iden-
                                                         therefore build the feature vector for each entity
tified in the preceding phase. Clustering is then
                                                         pair by juxtaposing the average vectors for each
applied to group similar entity pairs. Small clus-
ters indicate the lack of repetitive patterns and will      1
                                                             https://tools.wmflabs.org/catscan2/
therefore be considered as entity pairs which are        catscan2.php
                                                           2
not in relation to each other, while larger clusters         https://en.wikipedia.org/wiki/
                                                         Special:Export
are likely to correspond to different relation types.      3
                                                             medialab.di.unipi.it/Project/
    Relations are clustered and labeled using the ap-    SemaWiki/Tools/WikiExtractor.py
                                                      *(%&%2)*+%,$-%&'(                                 text as a candidate to be further analysed.
 *#"-%,'(&-
   3%$#&$(
                                          !"#$%&!'()*
                                                                                       97;$00-,3!
                                                                                                           Afterwards, for each token occurring in the
  4"5&-$#)
  !"-',50
                        #)+$,-.$(        ')8!#733$(           9$::7&-.$(            6$<-=70!>,&-&--$2
                                                                                     "?698!5!'@A*
                                                                                                        identified pattern, we search for matches of the
                                                                                                        corresponding lemma in the dictionaries. In case
                                                                                                        of multi-word expressions, when several patterns
                                       !"#$%&'()*+%,$-%&'(
              A7((-$(!C$7&1($                                                                           apply to overlapping strings of tokens, we apply
                >%&(7=&-),                                             /012&$(-,3
                                        '7-(!H<$,&-I-=7&-),
                                                                       "456$7,2*                        a greedy approach by choosing the longest one
               F5G(7:2
              >%&(7=&-),                                                                                matching the input.
                                                                                                           The output is produced following the TextPro
                                !"#$%&'().#/0%",&(1
                                                                                                        format, that is a line for each token, and a col-
                         B)(<!                                                                          umn for each analysis level. In our system these
                       >:;$<<-,3!
                                             B)(<!>:;$<<-,3!
                        C$7&1($!
                       >%&(7=&-),
                                             C$7&1($!D-=&-),7(E                                         files are enriched by the information about Medi-
                                                                                                        cal and Pharmaceutical entities obtained from the
                     /012&$(-,3
                     "456$7,2*
                                                 .#/0%",)6$7"#&(1                                       dictionaries provided by UMLS4 and PRB5 . These
                                                                                                        information are labeled as MED for the medical
                                                                                                        entities, and FAR for the pharmaceutical ones (the
Figure 1: Architecture for Relation mining from                                                         whole entity tag list is shown in the Table 1).
clinical records.
                                                                                                             Table 1: List of medical sub-categories
entity and input this representation into a k-means                                                                   Description                 Label
clustering (Manning et al., 2008; Shalev-Shwartz                                                                        Medical                   MED
and Ben-David, 2014).                                                                                               Pharmaceutical                FAR
                                                                                                                       Anatomy                    ANA
2.1      Input Preprocessing                                                                                           Organisms                  ORG
The text, processed by our system, is extracted                                                                         Diseases                  MAL
from anonymized medical records, in the form of                                                                  Chemicals and Drugs              CHE
plain text encoded in UTF-8. The text includes                                                               Technical medical equipment          TEC
a small set of special characters, used as delim-                                                             Psychology and Psychiatric           PSI
iters and/or formatters. The largest part of these                                                                      Biology                    BIO
medical records has been produced by an HL7-                                                                       Natural Sciences               NAT
compatible information system. At the end of each                                                           Anthropology and Social Science       SOC
medical record, there is often an ICD9M (Interna-                                                         Technology, Industry and Agriculture IND
tional Standard for Encoding and Classifying Dis-                                                                     Humanities                  UMA
eases) disease code, which we disregard together                                                                   Computer Science                INF
with the rest of the structured part of the records.                                                               Groups of People               GRU
   The text is initially preprocessed for extracting                                                                  Health care                 ASS
textual parts from the medical records, and to get                                                           Characteristics of Publication       PUB
rid of non-textual characters. The plain text, pro-                                                                    Locations                  LOC
duced by this preprocessing step, is passed to the
natural language processing suite TextPro to per-                                                          In addition to a label indicating whether the en-
form tokenization, sentence splitting, PoS tagging                                                      tity is medical (MED) or pharmaceutical (FAR),
and lemmatization.                                                                                      we also add to each medical entity annotation the
                                                                                                        sub-categories included in the UMLS database in
2.2      Entity Extraction                                                                              correspondence to the dictionary entry. The list of
                                                                                                        sub-categories labels are summarized in Table 1.
Entity extraction is crucial for our analysis, and                                                      A side-effect of such sub-categorization is that the
a specific module has been implemented with the                                                         number of potential relations increases while it be-
goal of extracting entities which are relevant for                                                      comes possible to find more specific relations.
the application domain: biomedical and pharma-
                                                                                                          4
ceutical entities in our case. The module follows a                                                         Unified Medical Language System, http://www.
                                                                                                        nlm.nih.gov/research/umls
pattern matching approach by identifying each oc-                                                         5
                                                                                                            Pharmaceutical Reference Book, officially mantained by
currence of a number of PoS patterns in the input                                                       Agenzia Italiana del Farmaco
2.3   Relation Clustering                                larity from the cluster centroid: the first four pairs
We apply the k-means approach that identifies            are then chosen to characterize the cluster.
groups of relations of the same type appearing in           As discussed above, each FV can be partitioned
the data set. Each pair of entities occurring in the     in two parts: the first half corresponds to the first
same sentence identifies a potential relation, there-    entity in the pair, the second one to the other.
fore all possible entity pairs must be considered.       Such partition is consistently maintained during
We then apply a clustering algorithm to the set of       the whole processing. Also in the computation of
all the potential relations identified. We will disre-   centroids in the k-means clustering algorithm, the
gard all entity pairs belonging to clusters having a     former half of each centroid derives from the av-
size smaller than a given threshold.                     erage of the former half of the involved FVs and
   We then concentrate on the remaining entity           then corresponds to the first entity. Correspond-
pairs, which are likely to represent actual relations    ingly, the latter half of each centroid vector only
and semantically cluster them. The approach pro-         depends on the second entity of each involved pair.
posed for this is structured in three main modules:         The choice of the cluster to which a given item
Feature Construction, Clustering, and Cluster La-        is assigned is based on the cosine similarity. Its
beling. The first module builds a feature vector         computation can be divided in three parts: the dot
based on WEs for each relation candidate; for do-        product of the part of the two FVs corresponding
ing this, first it constructs a WE dictionary by us-     to the first entity, the same for the second entity
ing a large collection of unannotated texts, in our      and eventually the normalization with respect to
case extracted from Wikipedia. This module is            the whole FV. Therefore, the evaluation of the co-
based on word2vec6 (Mikolov et al., 2013). For           sine similarity is based on a trade-off between how
the feature vectors length we chose 500, which           similar are the first and the second entities in each
is the default choice, and set the minimum word          pair. In other words, they represent actual enti-
count to 3, to exclude the less frequent words from      ties pairs which are similar to the (abstract) cluster
the dictionary, obtaining a set of 260, 680 vectors.     representative, corresponding to the centroid.
   After that, the k-means clustering is applied to      3   Conclusions and future work
the set of feature vectors obtained by the first mod-
ule. For every entity pair we then construct a Fea-      In this paper we presented a system for the extrac-
ture Vector (FV) starting from the WE of each            tion of information from clinical records in Italian.
word involved. Each entity can be composed by            A first part of the system aims to extract domain
one or more words, as for example conati di vom-         relevant entities from medical reports by a pattern
ito: in this case, for each entity, we take the aver-    matching approach. A second part takes the out-
age among the WEs of the words composing the             put of the former step and applies a clustering ap-
entity associated to the entity pair. Finally, we        proach to explore possible relations between such
concatenate the FVs of the two entities, obtaining       entities. A third part is based on WE and aims to
a FV of 1, 000 entries.                                  give cues about the type of the relations.
   The clustering algorithm is then applied to the          Interestingly, the approach does not require an-
FV data set by means of the C Clustering li-             notated data, but only easily available data such as
brary (de Hoon et al., 2004), a fast C imple-            Wikipedia and off-the-shelf tools in addition to the
mentation of the k-means algorithm. As the k-            documents to process. Naturally, available tools
means is characterized by a random initial choice        have been trained on annotated data, but without
of the seeds, we repeated each run 10 times, al-         any adaptation to the specific domain. It would
ways choosing the best solution. We considered           therefore be interesting to port it to a new lan-
the cosine similarity, choosing a number of clus-        guage, possibly different from English, which rep-
ters equal to 40, which seemed a reasonable choice       resents the most widely studied among all lan-
given the results from the experiments in (Alicante      guages.
et al., 2016b) and in (Alicante et al., 2016a).
   Eventually, to label each cluster we ordered the      Acknowledgments
pairs in each cluster according to its cosine simi-      The research presented in this paper was partially
  6
    The software is freely available at https://code.    supported by the national projects CHIS - Cultural
google.com/p/word2vec/                                   Heritage Information System (PON), and BIG4H
- Big Data Analytics for E-Health Applications
(POR).


References
Anita Alicante, Anna Corazza, Francesco Isgrò, and
  Stefano Silvestri. 2016a. Semantic cluster labeling
  for medical relations. In Proceeding of Innovation
  in Medicine and Healthcare 2016, pages 183–193,
  Puerto de la Cruz, Tenerife, Spain. Springer.
Anita Alicante, Anna Corazza, Francesco Isgrò, and
  Stefano Silvestri. 2016b. Unsupervised entity and
  relation extraction from clinical records in Italian.
  Computers in Biology and Medicine, 72:263–275.
Piero A. Bonatti and Luigi Sauro. 2013. A confi-
   dentiality model for ontologies. In Harith Alani,
   Lalana Kagal, Achille Fokoue, Paul T. Groth, Chris
   Biemann, Josiane Xavier Parreira, Lora Aroyo,
   Natasha F. Noy, Chris Welty, and Krzysztof Janow-
   icz, editors, International Semantic Web Conference
   (1), volume 8218 of Lecture Notes in Computer Sci-
   ence, pages 17–32. Springer.
Michiel J.L. de Hoon, Seiya Imoto, John Nolan, and
  Satoru Miyano. 2004. Open source clustering soft-
  ware. Bioinformatics, 20(9):1453–1454.
C.D. Manning, P. Raghavan, and H. Schütze. 2008.
  Introduction to Information Retrieval. Cambridge
  University Press.
Tomas Mikolov, Greg Corrado, Kai Chen, and Jef-
  frey Dean. 2013. Efficient estimation of word
  representations in vector space. Proc. of the Inter-
  national Conference on Learning Representations
  (ICLR 2013), pages 1–12.
Denis Paperno and Marco Baroni. 2016. When
  the Whole is Less than the Sum of its Parts:
  How Composition Affects PMI Values in Distribu-
  tional Semantic Vectors. Computational Linguis-
  tics, 42(2):345–350.
Emanuele Pianta, Christian Girardi, and Roberto
  Zanoli. 2008. The TextPro Tool Suite. In Proceed-
  ings of the Sixth International Conference on Lan-
  guage Resources and Evaluation (LREC’08), pages
  28–30, Marrakech, Morocco. European Language
  Resources Association (ELRA).

Shai Shalev-Shwartz and Shai Ben-David. 2014. Un-
  derstanding Machine Learning: From Theory to Al-
  gorithms. Cambridge University Press, New York,
  NY, USA.

</pre>