=Paper= {{Paper |id=None |storemode=property |title=A New Method to Retrieve, Cluster And Annotate Clinical Literature Related To Electronic Health Records |pdfUrl=https://ceur-ws.org/Vol-744/paper3.pdf |volume=Vol-744 }} ==A New Method to Retrieve, Cluster And Annotate Clinical Literature Related To Electronic Health Records== https://ceur-ws.org/Vol-744/paper3.pdf

A New Method to Retrieve, Cluster And
Annotate Clinical Literature Related To
Electronic Health Records

Izaskun Fernandez1 , Ana Jimenez-Castellanos2 , Xabier Garcı́a de Kortazar1 ,
and David Perez-Rey2
1
Tekniker-IK4
{ifernandez,xkortazar}@tekniker.es
2
Dept. Int. Artificial, Facultad de Informática, Universidad Politécnica de Madrid
{ajimenez,dperez}@infomed.dia.fi.upm.es

Abstract. The access to medical literature collections such as PubMed,
MedScape or Cochrane has been increased notably in the last years by
the web-based tools that provide instant access to the information. How-
ever, more sophisticated methodologies are needed to exploit efficiently
all that information. The lack of advanced search methods in clinical
domain produce that even using well-defined questions for a particular
disease, clinicians receive too many results. Since no information analysis
is applied afterwards, some relevant results which are not presented in
the top of the resultant collection could be ignored by the expert causing
an important loose of information. In this work we present a new method
to improve scientific article search using patient information for query
generation. Using federated search strategy, it is able to simultaneously
search in different resources and present a unique relevant literature col-
lection. And applying NLP techniques it presents semantically similar
publications together, facilitating the identification of relevant informa-
tion to clinicians. This method aims to be the foundation of a collabora-
tive environment for sharing clinical knowledge related to patients and
scientific publications.

Keywords: Electronic Health Record, search engines, literature retrieval,
integration, federated search, collaborative environment

1 Introduction
Web technologies have produced an explosion in the production and availability
of clinical publications. Access to massive amounts of information by the prac-
titioners, has led to include literature search tools to support patient diagnosis
and treatment. However, clinical publications are focused on population groups
rather than specific patients. So in order to find relevant publications that fit
with specific characteristics of a particular patient, clinicians must generalize
the characteristics of the Electronic Health Record (EHR) to meet the generic
patterns of the relevant literature.

19
2
Clinical Fernandez
Literature et. al.To Electronic Health Records
Related

Clinicians accessing PubMed , MedScape or Cochrane for searching publi-
cations related to patients frequently get too many results which they should
revise in order to recover relevant information. In addition, biomedical informa-
tion is nowadays distributed across different repositories, so they have to execute
multiple queries in the distributed resources to retrieve publications regarding
a specific patient and pathology. In other areas, meta-search engines and other
tools have been widely used facilitating the integration of information from mul-
tiple sources. However, in clinical practice, there are few trials exploiting the
potential of such cutting-edge technologies.
In this work we propose a method aiming to provide a robust environment for
searching literature in the daily clinical practice of the physicians. The proposed
method is based on previous works regarding EHR-based literature retrieval [1],
improving the integration of searching results and extending its functionality
with publications clustering and annotation strategies.
The article is structured as follows: the next section presents related works; in
section 3 the method for retrieving and analysing biomedical literature based on
EHRs and clusterization techniques is described; section 4 presents the obtained
results and evaluation within the Tratamiento 2.0 project environment; to finish,
conclusions and future works are explained.

2 Background

In recent years there has been an increasing interest in extracting and enriching
the content of EHR standards ([2], [3]) such as: Health Level 7 Clinical Docu-
ment Architecture (HL7-CDA) [4] and openEHR [5]. Since the majority of the
content of EHR and clinical data is free text instead of structured information,
Natural Language Processing techniques are frequently required. For instance,
the method presented in [7] is based on GATE (General Architecture for Text
Engineering) and the UMLS standard vocabulary to extract specific information
from pathology reports.
Previous works also exist on discovering biomedical relationships from se-
mantic annotations [8] [9], but they are limited to present just PubMed abstract
collections as result. In [1], the authors presented a system to support clinicians
in the literature search process using EHR information. For literature retrieval,
the federated search methodology [10] was used for integrating results obtained
from different information sources, pre-selected by the user. Search engines like
Sphinx [6] help in the development of this kind of systems, indexing the infor-
mation and providing a fast information retrieval.
In information retrieval systems, the query formulation is a crucial step to
obtain suitable results. The EBM (Evidence-Based Medicine Working Group
1992) recommends to create clinical questions using the PICO frame. Improved
results were obtained using the PICO query format [11] [12] but, the creation of
this queries is not a trivial task as it is shown in [13].

20
Clinical
Clinical Literature Related LiteratureHealth
To Electronic Related To Electronic Health Records
Records 3

The literature in the area describes numerous efforts on both, extracting
information from EHR and improving accuracy and efficiency of clinical searches,
but new tools are required to provide support to clinicians.

3 Method
We propose a method with a modular architecture to address the requirements
of an EHR-based literature process. The main modules are: a query generation
module, a federated-search module, an automatic literature processing process
and finally a publication annotation module. The architecture is graphically
presented at Fig.1

Fig. 1. Method’s architecture

The EHR-Based Query Generation module deals with electronic health record
information, extracting the most relevant features regarding specific pathologies.
These characteristics are used to give context to the searching criteria specified
by the expert for construct the final query, which will be used in the searching
phase to query different repositories in the second module. The method man-
ages every independent result set obtained from each repository, presenting the
result as a unique and integral literature collection, semantically annotated and
clustered in the third module. Within the fourth module, clinicians may also
annotate and comment the publications of the collection.

3.1 EHR-Based Query Generation
The aim of the first module is to support the clinician on literature query gen-
eration for a certain criterion regarding an individual patient and disease, for-
mulating the query based on PICO model. The method combines both the rel-
evant information of the individual patient for the particular disease and the
theme about the clinician wants to search about. A PICO query is composed
by four main elements, P,I,C and O: P represents the problem/population; I
intervention’s information; C the comparison; and finally O the outcome. When
constructing PICO queries not always all the elements are necessary.
Let’s assume that a clinician wants to search about if there is any contraindi-
cation to prescribe ECA inhibitors (described as current criterion) for Patient A
that is a chronic diabetic patient (previously mentioned as pathology). In PICO

21
4
Clinical Fernandez
Literature et. al.To Electronic Health Records
Related

format, the P element should be represented by the most relevant characteristics
of Patient A about his/her chronic diabetes; the I element by ECA inhibitors
drug; and the O element by contraindication, which is the outcome the clinician
wants to know about. For this query, since no comparison criterion is used, the
C element would be empty.
Since the aim of this module is to automatize the process of extracting and
assigning values to PICO components, it applies different strategies to the input
parameters. For P definition, the module uses a a web service that extracts the
relevant characteristics from the input EHR about a certain pathology. As it is
detailed in [1], the web service uses a knowledge base (KB) which defines the
relevant parameters and their values. And applying the transformation rules in
the KB it translated the structured or numeric values into concepts for P, such
as high glucose.
For completing I, C and O components in a efficient way, minimizing the
non relevant information expressed on the clinician’s expressions concerning to
the current practice to search about, the module applies a NLP process. A NLP
process which takes as input a Spanish or English free text and extracts the most
relevant terminology. For that purpose it uses a GATE3 application which com-
bines a tokenizer, sentence splitter, part of speech tagger, stemmer and finally
an ontology based gazetteer tagger, and tags all the expressions in the text re-
ferring to any term in the ontology. The stemmer application is a crucial step in
order to identify not only the perfect matching expressions but also the inflected
mentions of the ontology vocabulary, such as potassium ion levels, the pluralized
form of potassium ion level ontology concept. The ontology for the gazetteer tag-
ger is parametrizable, which makes the module portable for different pathologies
and domains.
The EHR-Based Query Generation interacts with the clinician, presenting
the automatically extracted terminology and he/she must select the most rele-
vant ones for each PICO field.

3.2 Federated Search

Federated search module allows searching different repositories simultaneously
with the same query. In this module clinicians can not only select the desired
resources from a predefined list, but also they can weight the selected resources
establishing preferences among them. The influence of these preferences in results
will be sown at 3.3 section.
The query defined in the previous module is used to search every selected
source. Since each resource should manage the information in a different way, a
configuration file is associated to each source defining the way to ask and how
to get results from it. Concretely, the configuration file specifies how the PICO
query should be transformed for asking the current resource and also, how to
3
General Architecture for Text Engineering is a open source software capable of solv-
ing almost any text processing problem. More information at http://gate.ac.uk/

22
Clinical
Clinical Literature Related LiteratureHealth
To Electronic Related To Electronic Health Records
Records 5

interpret the searching output. The configurations files are implemented using
xslt, a declarative language designed for defining XML file transformations.
Using xslt files for specifying source dependent characteristic instead of in-
cluding them in the module code, makes it flexible. Flexible in the sense that
defining a xslt configuration file is enough for including a new source in the
module.
So the module accesses to the configuration files of the selected sources trans-
forming the PICO query in the corresponding formats, asks the sources and
again, using the configuration files, it interprets each output and stores and
merges all the publications result sets in the server database as a unique col-
lection. For each publication this module gets the title (required), and a short
description, authors and the publication date when they are available. Since some
sources can share literature, the method eliminates duplicated publications using
title information.

3.3 Literature Processing

Literature collections without duplications are used in the following steps to
present not only a list of relevant publications, but also an enriched set of scien-
tific literature. In the server database for each publication among others, there
is available at least the title and a short description if it is extracted at searching
phase. Applying to that literature content the NLP terminology extractor pro-
cess described at Section 3.1, this module obtains the relevant terminology on
each publication. Since each publication is annotated separately, the method can
access to the terminology of each item on the collection and present the entire
collection as: (i) a content based relevance ranked list, (ii) a list of publications
ordered by both content and source relevance or (iii) a semantically clustered
collection.
In the first two options, using Sphinx index engine4 the method compares
the extracted literature terminology with the query terminology. It ranks the
publications measuring the shared terminology in both, query and publication,
and it adjusts the measure with the resource weight for the (ii) modality. This
way in (ii) option publications are ordered by their the relevance respect to the
query, but also by the clinician resource preferences using the belonging resource
weights defined in the federated search module (described at 3.2 section). In the
last option, a clustering strategy is applied to relate and group publications
based on their semantic similarity. The terminology of different publications is
compared, clustering the publications that shares similar terms. This way the
method represent semantically similar publications grouped together.

3.4 Publication Annotation

Finally the proposed method allows the clinicians to annotate the results with
two main functionalities: creating notes to remark any commentary about a
4
http://sphinxsearch.com/

23
6
Clinical Fernandez
Literature et. al.To Electronic Health Records
Related

certain publication; and voting each publication relevance respect to the current
query with like/dislike notations.
Actually all the notes are stored in the server database, in order to give
the opportunity to clinicians to refer their already revised collections without
repeating the entire process and showing the annotations they did in the past.

4 Method Validation

Two use cases have been developed within the environment of the Tratamiento
2.0 5 project to test the proposed method: diabetes and arterial hypertension.
For each pathology, we have implemented a knowledge base with the relevant
parameters. Based on these parameters, the service automatically extracts from
EHR the characteristic of current patient, and translates it to the defined vocab-
ulary. Concretely we have integrated English and Spanish version of the MeSH
ontology at the NLP process for this validation. An example rule from the im-
plemented rule-set is the one treating the systolic pressure (Sp) value: if Sp <=
90: then ”systolic hypotension”.
A preliminary test set of ten patient data and fifty queries has been used to
check the functionality of the method. The system has extracted correctly the
characteristic for all the patients and it has correctly identified the 90% of the
relevant papers, according to assessment of experts of the project.
In the context of the project OSI+ 6 , a user interface has been developed to
exploit the modules defined above. With patient health record’s characteristics
and the current query, the searching criterion is defined. At this point the clini-
cian has also to select which resources wants to use for searching literature and
provide weights if a different importance among resources are identified. Nowa-
days, the Cochrane library, PubMed , Fisterra and Ikere are accessible resources
from this implementation. The first two sources contain English literature while
the last two mentioned resources store Spanish publications.
Based on the defined parametrization, the system searches for literature, and
presents the results to the clinician. Previously launched searching results can be
accessed, since the system maintains a cache of this information —not only the
retrieved literature, but also the notes and the valuation the clinician has done for
each publication. When previous searching results are consulted, this information
is also presented to the clinician. The interface permits to search collection, not
only using free text, but also exploiting the annotated terminology.
Currently we are working on integrating the clustered results visualization
which is already running as service, but it is not exploited in the interface.

5
http://www.tratamiento20.es/inicio.html
6
http://www.i3b.ibermatica.com/i3b/noticias/2009/
osi-el-hospital-extendido

24
Clinical
Clinical Literature Related LiteratureHealth
To Electronic Related To Electronic Health Records
Records 7

5 Conclusions and Future Lines

In this paper we have presented a method to integrate biomedical literature
repositories with patient EHRs. This method provides a platform for retrieving
literature about a certain patient practice with minimal effort. The PICO query
generation is supported, automatically extracting characteristics from patient
EHR and applying NLP techniques for relevant terminology identification. Fed-
erated search strategy is used to access simultaneously to different information
sources and to present a unified collection. With this method clinicians can edit
the resultant collection, to record their considerations and to evaluate the rel-
evance of each publication. All the information would be available afterwards
to be accessed by the user without repeating the entire process. This method is
also being standardized by using EHR formats such as HL7-CDA instead of a
proprietary platform format.
Nowadays only access is provided to all the information generated by the
users regarding the publications —i.e. agreement about the publications’ rel-
evance, tags and notes. But analysing that information, user profiles can be
generated to facilitate model and adjustment of the results based on users’ pref-
erences. So, the method is intended to be the foundation of further development
of a collaborative environments where clinicians could share their searches, notes
and all the knowledge generated around a search. It would became a new re-
source of know-how that combines scientific studies with the daily experience of
clinicians themselves.

References
1. Jimenez-Castellanos, A., Fernandez, I., Perez, D., Viejo, E., Dı́ez, F.J., Garcı́a de
Kortazar, X., Garcı́a, M., Maojo, V., Cobo, A: Patient based literature retrieval
and integration: A use case for diabetes and arterial hypertension. Proceedings of
the International Conference on Health Informatics, 33–41 (2011)
2. Meystre, S.M., Savova, G.K., Kipper-Schuler, K.C., Hurdle, J.E., Extracting infor-
mation from textual documents in the electronic health record: a review of recent
research. IMIA Year-book of Medical Informatics, Methods Inf Med 2008, 47 Suppl
1, 138–154 (2008)
3. Eichelberg, M., Aden, T., Riesmeier, J., Dogac, A., Laleci, G. B.: Electronic health
record standards A brief overview. Information & Communications Technology
(ICICT06) ITI4 (2006)
4. Dolin, R.H., Alschuler, L., Boyer, S., Beebe, C., Behlen, F.M., Biron, P.V., Shabo
Shvo, A.: HL7 Clinical Document Architecture, Release 2. J. Am. Med. Inform.
Assoc., vol. 13(1), 30–39 (2006)
5. Kalra, D., Beale, T., Heard, S.: The openEHR Foundation. Regional Health
Economies and ICT Services: The PICNIC Experience, 115/2005, 153-173 (2005)
6. Lee, K. F.: Automatic speech recognition: The development of the SPHINX system.
Kluwer Academic Pub (1989)
7. Liu, K., Mitchell, K.J., Chapman, W.W., Crowley, R.S.: Automating tissue bank
annotation from pathology reports - comparison to a gold standard expert anno-
tation set. Proceedings of the AMIA Annu Symp, 460-464 (2005)

25
8
Clinical Fernandez
Literature et. al.To Electronic Health Records
Related

8. Tsuruoka, Y., Tsujii, J., Ananiadou S.: FACTA: a text search engine for finding
associated biomedical concepts. Bioinformatics vol. 24(21), 2559-2560 (2008)
9. Kim, J.J., Pezik, P., Rebholz-Schuhmann, D.: Medevi: Retrieving textual evidence
of relations between biomedical concepts from medline. Bioinformatics, 24(11),
1410-1412 (2008)
10. Jacso, P.: Thoughts about federated searching. Information Today, 21(9), 17–20
(2004)
11. Schardt, C., Adams, M. B., Owens, T., Keitz, S., Fontelo, P.: Utilization of the
PICO framework to improve searching PubMed for clinical questions. BMC Med
Inform Decis Mak, 7–16 (2007)
12. Meats, E., Brassey, J., Heneghan, C., Glasziou, P.: Using the Turning Research
Into Practice (TRIP) database: how do clinicians really search?. J. Med Libr Assoc
156-163 (2007)
13. Huang, X., Lin, J., Demner-Fushman, D.: Evaluation of PICO as a knowledge
representation for clinical questions. AMIA Annual Symposium proceedings, 359–
363 (2006)