Healthcare Data Summarization via Medical Entity
                                Recognition and Generative AI
                                Giuseppe Riccio1,2 , Antonio Romano1,2 , Andriy Korsun1,2 , Michele Cirillo1,2 ,
                                Marco Postiglione1,2 , Valerio La Gatta1,2 , Antonino Ferraro1,2 , Antonio Galli1,2 and
                                Vincenzo Moscato1,2
                                1
                                    University of Naples ”Federico II”, Via Claudio 21, Naples, 80125, Italy
                                2
                                    BIG DATA CINI National LAB - Node University of Naples ”Federico II”, Naples, 80125, Italy


                                                                         Abstract
                                                                         This paper presents a fully automated approach for extracting value from content that lies hidden in
                                                                         Electronic Health Records (EHRs) using Large Language Models (LLMs) and Natural Language Processing
                                                                         (NLP) techniques, such as Named Entity Recognition (NER) and Entity Linking (L). In particular, the
                                                                         state-of-the-art approaches used to solve this task suffer from problems related to poor automation,
                                                                         given the laborious process of fine-tuning the models used and the difficult interpretation of the results
                                                                         obtained from them. The solution proposed in this work, on the other hand, aims to show the potential
                                                                         of NLP and generative AI to extract the relevant medical concepts contained within EHRs and generate a
                                                                         summary of the entire clinical history of each patient to construct a simple and intuitive dashboard that
                                                                         supports medical personnel with relevant medical information and useful analytics in order to diagnose
                                                                         and make decisions regarding the clinical condition of a patient.

                                                                         Keywords
                                                                         Named Entity Recognition, Entity Linking, Relation Extraction, Summarization, Generative AI


                                1. Introduction
                                The exponential increase of complex aggregated data in the healthcare sector and beyond has
                                made understanding such data by medical professionals a challenging task. Extracting relevant
                                information from Electronic Health Records [1], such as clinical notes, represents a significant
                                challenge that requires advanced solutions based on the potential of Big Data and Natural
                                Language Processing (NLP). In this article, we review the current approaches proposed in
                                the scientific literature and present a possible solution that takes advantage of Named Entity


                                ITADATA2023: The 2nd Italian Conference on Big Data and Data Science, September 11–13, 2023, Naples, Italy
                                Envelope-Open giuseppe.riccio9@studenti.unina.it (G. Riccio); antonio.romano45@studenti.unina.it (A. Romano);
                                a.korsun@studenti.unina.it (A. Korsun); michele.cirillo2@studenti.unina.it (M. Cirillo); marco.postiglione@unina.it
                                (M. Postiglione); valerio.lagatta@unina.it (V. L. Gatta); antonino.ferraro@unina.it (A. Ferraro);
                                antonio.galli@unina.it (A. Galli); vmoscato@unina.it (V. Moscato)
                                GLOBE https://github.com/giuseppericcio (G. Riccio); https://github.com/LaErre9 (A. Romano);
                                https://github.com/andriykorsun (A. Korsun)
                                Orcid 0009-0002-8613-1126 (G. Riccio); 0009-0000-5377-5051 (A. Romano); 0000-0003-1470-8053 (M. Postiglione);
                                0000-0002-5941-4684 (V. L. Gatta); 0000-0002-1326-0325 (A. Ferraro); 0000-0001-9911-1517 (A. Galli);
                                0000-0002-0754-7696 (V. Moscato)
                                                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                                                                           1


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
                        Healthcare Data Summarization via Medical Entity Recognition and Generative AI


Recognition (NER), Entity Linking (L), Relation Extraction (RE) and text synthesis techniques
such as Large Languages Models (LLMs).
   In the scientific literature, several approaches have been proposed for the automatic extraction
and synthesis of information from clinical records. One of the main approaches used is Named
Entity Recognition (NER) [2], that focuses on the identification and extraction of relevant
entities, which in this case could include diseases, drugs, medical procedures and symptoms
within clinical texts. The use of clustering techniques [3] is common to organise and categorise
extracted data, e.g. by applying clustering algorithms to group together notes dealing with
similar topics. Furthermore, text synthesis techniques are used through the use of Generative
Language Models [4], which enable the generation of coherent and contextually appropriate
summaries based on the extracted data, for example, reports on a patient’s medical history,
providing a clear and concise picture of medical conditions, prescribed drugs and relevant
procedures performed.
   This paper presents a summary of the final project conducted as part of the Big Data Engi-
neering course offered in the Master’s Degree program in Computer Science at the University
of Naples Federico II. The project aimed to address the challenges of the Big Data domain, with
a specific focus on data management, processing, analysis, report generation, and protection.
Our proposal is based on the combination of advanced techniques previously discussed, such as
NER and LLMs.
   To evaluate the effectiveness of our solution, we employed the MIMIC III dataset as our data
source. This dataset is widely recognized for its extensive coverage, representativeness, and
richness of clinical information. Through the application of our approach to this dataset, we
demonstrate the information extraction and synthesis process, presenting the obtained results
and their validity in the clinical context.
   With this article, we fill a significant gap in the scientific literature by making a relevant
contribution to the development of advanced approaches for the automatic extraction and
synthesis of concepts from medical records, exploiting the potential of Big Data and LLMs to
provide a comprehensive view of a patient’s medical condition over time.
   The results obtained from this study are valuable both for the academic community and the
industry, as they offer a solid foundation for further research and advancements in the field of
clinical data management and data engineering.


2. Related Work
In other fields than biomedical, several annotation interfaces have been developed for popular
Natural Language Processing (NLP) tasks, such as Named Entity Recognition (NER), Entity
Linking (EL), Relation Extraction (RE), Entity Normalization, Dependency Parsing, Chunking
and so on. Among the available options, open-source tools such as BRAT (Stenetorp et al.,
2012 [5]) have gained popularity. BRAT not only facilitates the management, monitoring and
collection of annotated document corpuses, but also supports general annotation tasks. Another
tool, Prodigy1 , is a commercial product that offers a modern annotation method for creating
training and evaluation data for machine learning models. Although this tool can use various
1
    Documentation available at this site: https://prodi.gy/docs


                                                            2
                  Healthcare Data Summarization via Medical Entity Recognition and Generative AI


models to suggest entities, being based on the well-known NLP SpaCy library, it lacks automated
integration with existing biomedical NER+L systems. Regarding biomedical NER+L, previous
scientific research has introduced tools such as MetaMAP (Aronson, 2001 [6]) and CTakes
(Savova et al., 2010 [7]). These tools allow users to inspect recognized entities but do not provide
mechanisms to correct and refine concepts or specify additional annotations based on specific
research areas. Another tool called SemEHR (Wu et al., 2018 [8]) focuses on biomedical NER+L
but differs in its approach from previous tools. Indeed, SemEHR allows for the incorporation of
customized preprocessing and postprocessing steps and supports research-specific use cases.
However, it does not directly improve the NER+L model through an interface, but treats the
provided NER+L model as a black-box model, with no possibility of changing the recognized
entities or obtaining more in-depth meta-information on those returned.
   Regarding the summarization task, there are two main approaches in the literature: the
first, called extractive summarization, involves the generated summary being composed of
sentences extracted from the text provided as input based on a metric of importance of those
sentences in the context of the text. The second approach, called abstractive summarization,
involves extracting words within the text and reprocessing them to compose semantically
related sentences. Regarding extractive summarization, several solutions have been proposed
involving the use of neural networks, in which the problem is formulated as a classification
task and networks composed of encoders and decoders are used (Cheng and Lapata, 2016 [9];
Nallapati et al., 2016 [10]) or pre-trained language models (Egonmwan and Chali, 2019 [11]; Liu
and Lapata, 2019 [12]). In contrast, the state of the art in abstractive summarization involves
numerous approaches, the most popular of which is the use of pre-trained encoder-decoder
transformer models on a masked pre-training input target, the most popular of which are MASS
(Song et al., 2019 [13]), UniLM (Dong et al., 2019 [14]), and BART (Lewis et al., 2019 [15]).
   However, the approaches just mentioned require manual work to classify each document
and the concepts it contains, and it is impractical to manually annotate large datasets such as
those of patient notes. In this paper we try to explore, therefore, not only the development of a
simple interface for annotating clinical texts, through NER+L models but also the generation
of complete summaries of a patient’s clinical history using LLMs, which allows to provide
these summaries starting from the clinical notes treated in an appropriate way in a rapid and
completely automated way with no need to fine tune a model as required by other approaches.
Furthermore, through the integration of the entities extracted from the NER+L+RE and with
the patient’s summary, it is possible to provide, through a simple and intuitive dashboard, a
series of analytics that support the diagnoses and decisions to be made with respect to an ill
patient by competent medical personnel.


3. Methodological Workflow
Within our study, therefore, our goal was to develop a solution for the automatic summarization
and report generation of clinical notes using, as previously written, a combination of Named
Entity Recognition (NER), Linking (L), Relation Extraction (RE) and Language Models (LLMs).
In order to achieve this, we followed a detailed workflow as follows: (Figure 1)


                                                 3
                 Healthcare Data Summarization via Medical Entity Recognition and Generative AI


Figure 1: Workflow of the proposed method


                                              4
                  Healthcare Data Summarization via Medical Entity Recognition and Generative AI


3.1. Data collection
We used the anonymised clinical database provided by MIMIC-III [16] [17] [18]. In particular,
we extracted information from the tables ”PATIENTS” and ”NOTEEVENTS”. This data collection
phase forms the basis of our study, although further processing is necessary to improve the
data quality.

3.2. Data extraction
We conducted a data analysis to understand the distribution and composition of the clinical
notes. This analysis helped us decide which fields to extract. During the data extraction process,
we applied additional filters according to the limitations imposed by the selected data storage.
In particular, we set a maximum limit of 10,000 tokens for the sum of the documents. We
performed additional filtering to select at least 2 documents per patient and removed duplicate
rows. Moreover, we extracted the demographic information of the patients from the ”PATIENTS”
table and merged it with the clinical notes dataframe to obtain a final dataframe upon which to
perform subsequent project operations. The final data was stored in the selected data storage.

3.3. Data preprocessing
We selected a sample of the previously extracted data in order to make the further processes
more efficient. Next, we applied the Data Preprocessing step to the sampled clinical notes. This
operation included the extraction of relevant sections from the clinical notes (Section Extraction)
and the application of textual preprocessing techniques, such as stopwords elimination and
lemmatization, to the text of the clinical notes.

3.4. Summarization
To generate the summaries of the clinical notes, we used a model based on Large Language
Models (LLMs). The input for the model was structured according to the following prompt:

    • System: e.g for summary: ”You are a formal medical assistant specialising in the summary
      of a patient’s clinical notes” while for clinical trend: ”You are a clinical trend extractor of
      a patient ’s clinical notes.”
    • User: an experimental approach:

                                              𝑃 =𝐸+𝑇 +𝑆                                           (1)

      Where:
         – P: Prompt;
         – E: Explanation: explanation of what is wanted from the model;
         – T: Text: text that the model must deal with and summarise, in this case the clinical
           notes;
         – S: Show the type of the task: demonstration of an example of output useful for the
           model;


                                                 5
                                 Healthcare Data Summarization via Medical Entity Recognition and Generative AI


     Explanation

  Generate a complete but concise (max 100 words) and informative summary that
  focuses only on the unique patient, that is always the same throughout the notes in
  input, and his medical history, current condition, and relevant details, starting from
  his current conditions backwards; The summary must be accurate and avoid
  unnecessary repetition, avoid details of prescriptions, doses and other subjects
  besides the specific patient; If there are dates, enter the year, month and the day; Do
  not include doctors' names; It must be easy for a doctor to understand, the tone is
                                                                                                   Explanation
  formal; The summary always starts with "The patient is a:" ".
                                                                                                Identify in 1 word the clinical trend like this: extract a single word that accurately

          Text                                                                                  represents the clinical trend observed in the patient's notes, considering every detail

  here goes the sequence of clinical notes representing the medical history                     and keyword; For the word, use the general indicators: 'Improvement', 'Stable',
                                                                                                'Worsening' to describe the trend; Otherwise identify in 1 word: "Dead" if the
         Show                                                                                   patient from his notes is dead.

  The patient currently is a age-year-old gender who presented with chief complaint
  and has a medical history of relevant medical conditions. The patient was involved                    Text

  in incident/accident description, which led to specific injuries/traumas; The                 here goes the sequence of clinical notes representing the medical history

  postoperative course involved relevant procedures performed on anatomical
                                                                                                       Show
  locations involved; The patient is currently prescribed medications for specific
  purposes;                                                                                     Improvement


                                  (a) Summary                                                                                (b) Clinical trend

Figure 2: Examples of summarization prompts.


We applied the same structure for both the general summary of the clinical notes (Summary)
and the identification of clinical trends (Clinical Trend).
   Summary: the aim of the summary is to provide an overview of the patient’s condition,
treatments carried out, main diagnoses or other key points in the clinical notes. Figure 2(a)
shows an example of summarization prompts.
   Clinical trend: refers to patterns or changes observed in a patient’s clinical picture over
time. Figure 2(b) shows an example of prompt used to infer the medical history trend.

3.5. Named Entity Recognition + Linking + Relation Extraction
From the preprocessed data, we applied the Named Entity Recognition (NER) and Entity Linking
(L) process to extract medical entities from the clinical notes and link them to external knowledge
databases. In addition, we performed Relation Extraction (RE) to identify possible relationships
between entities. This information was used to create a Knowledge Base for graph analysis, in
which diagnostic analyses were performed.

3.6. Analytics and Report
We summarized the results obtained from the clinical note summary process and the Named
Entity Recognition + Entity Linking + Relation Extraction process. These results were presented
in a final report documenting the main results obtained within our study.


                                                                                            6
                     Healthcare Data Summarization via Medical Entity Recognition and Generative AI


4. Results
4.1. Implementation details
The techniques seen above are applied by randomly selecting 20 patients from the MIMIC-
III dataset [17]; in particular, only the ”Discharge summary” of patients selected from the
NOTEEVENTS table is considered2 .
   Then, through the MedSpaCy [19] module, the extraction of the main sections, i.e., those
most present, from all patients’ clinical notes is performed. MedSpaCy was chosen because that
library is already pre-trained to recognize sections present in clinical texts. Thus, for the case
under consideration, the following sections are selected: ”chief complaint,” ”history of present
illness,” ”past medical history,” ”discharge medications,” ”brief hospital course” and ”discharge
diagnoses”. In the clinical notes with the extracted sections, a following stage of stopwords
elimination and lemmatization is carried out through the NLTK3 module using stopwords of
the English language, the clinical notes being in that language, and WordNet as a lemmatizer.
   To perform the summarization task we chose to use as LLM the GPT-3.5-turbo 4 model
provided by OpenAI, this model, in fact, is the one that starting from a prompt and the clinical
notes of the incoming patient manages to return a short, but at the same time complete, summary
containing all the main information regarding the patient’s medical history. In addition to the
patient summary, the patient’s clinical trend is also generated using the same model, which,
based on the evolution of his clinical history shown in the notes, provides an indication of the
patient’s current status.
   For the NER task it was decided to use the MedCAT [20] model, this model was found
to be the one with the best performance for the requested task being pre-trained for natural
language processing in the medical field. In particular, it is very useful for extracting information
from Electronic Health Records (EHR) (NER phase) and linking them to biomedical ontologies
such as SNOMED-CT and UMLS (Entity Linking phase). Since MedCAT is only a model that
correctly extracts and labels entities, it is necessary to provide MedCAT with a knowledge
base from which to draw this information. For the case in question, it was decided to use the
MedMentions [21] library, which contains a corpus of biomedical documents annotated with
mentions of entities belonging to UMLS. This corpus contains about 35,000 medical concepts
and its MetaCAT model for meta annotations was built on a sample from MIMIC-III. An example
of a clinical note annotated with MedCAT using MedMentions is shown in Figure 3.
   Finally, the concepts extracted from the NER+L phase must be interconnected through the
relationships extracted with the Relation Extraction phase, also in this case these relations
are obtained using the GPT-3.5-turbo model. Given their extremely complex nature, these
relationships are particularly suitable to be stored via a graph database such as Neo4J5 .


2
  The complete code used to carry out the experiments reported in the following article is available in the following
  GitHub repository: https://github.com/giuseppericcio/HealthcareSummarizationMIMIC
3
  Documentation available at the site: https://www.nltk.org/
4
  More details on models provided by OpenAI: https://platform.openai.com/docs/models/gpt-3-5
5
  Documentation available at the site: https://neo4j.com/docs/


                                                         7
                         Healthcare Data Summarization via Medical Entity Recognition and Generative AI


Figure 3: Output example of the NER step on a clinical note (patient 27431).


                        (a) Patient 27431                                (b) Patient 30822

Figure 4: Clinical Trend and Summary generated for several patients


4.2. Dashboard and Analytics
Thanks to the entities extracted from the NER+L+RE phase and the summaries generated by the
LLM, it is possible to build a dashboard that supports the medical personnel during the diagnosis
and the choice of therapies to be carried out on a patient in order to cure his diseases. In this
instance, it was decided to use the Streamlit6 library for the construction of the dashboard
directly in Python. This choice is dictated by the simplicity of creating a dashboard using
this library which allows various effective views of the analytics that can be performed on
biomedical concepts recognized by NER+L+RE. In Figure 4 is possible to see some examples of
summaries generated from some clinical notes of MIMIC-III patients, through these summaries
a physician can understand all the diseases and procedures with respect to that patient in a fast
way than reading all of his clinical notes. From Figure 4(b), a hypothetical flaw in our workflow
emerges concerning the use of male gender instead of female in the summary. This discrepancy
is probably due to errors resulting from possible hallucinations of the LLMs model. We propose
to resolve this issue by extending the previously used prompt or by adopting a more advanced
model such as GPT 4.

6
    Documentation available at the site: https://docs.streamlit.io/


                                                             8
                  Healthcare Data Summarization via Medical Entity Recognition and Generative AI


4.2.1. Analytic 1: Lists of concepts extracted
First of all, through the dashboard, it is possible to view lists of drugs, symptoms and diagnostic
procedures related to a specific patient, as shown in Figure 5. Through this visualization, medical
personnel can immediately understand what the patient has been subjected to without having
to read all his medical records.


Figure 5: Lists of concepts extracted by NER+L for the patient 27431


4.2.2. Analytic 2: Diseases of a patient
Using the relations extracted from the NER+L+RE phase is possible to visualize some inter-
esting analytics. Through the concepts stored in Neo4J, which has also been integrated into
Streamlit via the py2neo and streamlit-agraph libraries. As shown in Figure 6(a), it is possible to
effectively display all the diseases associated with a patient. For example, the patient taken into
consideration presented ”Hypoxia” in all the medical records associated with him, therefore, it
could be deduced that he suffered from it chronically.

4.2.3. Analytic 3: Medical concepts related to a disease of a patient
With reference to the previous analytic, we can now explore all the diagnostic procedures,
drugs and other treatments performed on the patient to cure the ”Hypoxia” disease, as shown
in Figure 6(b), in order to facilitate physicians and nurses understand what has already been
done and what needs to be done later on to that patient.


5. Discussion
Our solution offers numerous benefits, such as automating the clinical note synthesis process,
improving productivity and reducing analysis time for healthcare professionals. Through the
use of advanced techniques such as NER+L+RE, we are able to identify medical entities and
the relationships between them, providing a solid basis for analysing and interpreting clinical
information.
   However, the automatic extraction and synthesis of information from clinical records presents
significant and sensitive challenges. The management of sensitive data, in compliance with
regulations such as the General Data Protection Regulation (GDPR), the Health Insurance


                                                 9
                   Healthcare Data Summarization via Medical Entity Recognition and Generative AI


                 (a) Diseases of the patient           (b) Procedures, drugs and so on related to a
                                                       specific disease

Figure 6: Graph Analytics for the patient 27431


Portability and Accountability Act (HIPAA), and the Artificial Intelligence Act, requires special
attention to privacy and the protection of sensitive personal data. Therefore, it is essential to
ensure security and compliance with guidelines regarding data access, storage and disposal.
   To address these ethical and regulatory issues, robust security measures must be implemented
to protect patients’ personal data and ensure transparency in the use of clinical information.
Furthermore, it is crucial to inform patients about the processing of their data and the generative
nature of the results obtained, emphasising that the system does not replace the work of doctors.
Adherence to ethical standards is of paramount importance to maintain patients’ trust and to
ensure responsible and safe use of clinical information. For example, the automatic generation
of summaries raises ethical issues regarding the accurate interpretation of clinical information,
so it is necessary to ensure the reliability of the generated results and assess the quality of the
summaries through further research and validation.
   Another disadvantage comes from the tools and limitations of the models used. The specialised
medical language used in clinical registries, with abbreviations and technical terminology,
requires the use of additional resources, such as medical dictionaries or abbreviation recognition
systems, in order to overcome the challenges of information interpretation and extraction.
Furthermore, the size of clinical note datasets requires the use of efficient text processing models
in order to handle large volumes of data.
   Finally, our work represents a step towards automation and optimization of clinical note
analysis, but further studies and collaborations are needed to improve the accuracy, reliability
and adherence to ethical standards of our solutions.


6. Conclusion and Future Work
Our work has developed a solution for the automated summarization of clinical notes using
NER+L+RE and LLM techniques, providing fast decision support for healthcare professionals


                                                  10
                  Healthcare Data Summarization via Medical Entity Recognition and Generative AI


towards patients. The preliminary results obtained in this paper were submitted to a committee
of domain experts for review, who upon initial analysis evaluated the work positively. However,
there are many possibilities for further development and future work arising from this project.
Some ideas include:

    • Predictive analytics: Expanding our solution to include predictive analytics models that
      can provide estimates of patients’ future conditions, such as the likelihood of developing
      certain diseases or response to certain treatments;
    • Patient profiling: Create a comprehensive overview of all patients treated, allowing
      clinicians to identify high-risk patients. This would require unsupervised data analysis,
      such as using clustering algorithms to group patients according to common characteristics;
    • Interactivity: Increased interactivity through human body diagrams that display diseased
      or clinically affected body parts with a brief summary of the problem;
    • Integration: Expand our solution to be easily integrated with databases from different
      hospitals, allowing healthcare professionals to use the system with their own data;
    • Interpretability: Improve the transparency and interpretability of the system by pro-
      viding clear explanations of the forecasts and recommendations generated. This would
      help physicians understand the reasons behind the results and have confidence in the
      information provided.
    • Q/A (Question/Answer): Implement a Q/A interface that allows doctors and patients to
      interact directly with the system, asking specific questions and getting precise answers
      based on the data in the database.

   In conclusion, the future goal is to continue to develop solutions that improve the efficiency
(currently the proposed solution on 20 patients takes an average time of 298 seconds) and
accuracy of clinical note analysis through systematic and more formal approaches.


References
 [1] D. Charles, J. King, V. Patel, M. Furukawa, Adoption of electronic health record systems
     among u.s. non-federal acute care hospitals, ONC Data Brief No. 9 (2013) 1–9.
 [2] D. Soomro, S. Banbhrani, A. Shaikh, H. Raj, Bio-ner: Biomedical named entity recognition
     using rule-based and statistical learners, 2017.
 [3] T. Loftus, B. Shickel, J. Balch, P. Tighe, K. Abbott, B. Fazzone, E. Anderson, J. Rozowsky,
     T. Ozrazgat Baslanti, Y. Ren, S. Berceli, W. Hogan, P. Efron, J. Moorman, P. Rashidi,
     G. Upchurch, A. Bihorac, Phenotype clustering in health care: A narrative review for
     clinicians, Frontiers in Artificial Intelligence 5 (2022) 842306. doi:10.3389/frai.2022.
     842306 .
 [4] M. K. Rohil, V. Magotra, An exploratory study of automatic text summarization in biomed-
     ical and healthcare domain, Healthcare Analytics 2 (2022) 100058. doi:10.1016/j.health.
     2022.100058 .
 [5] P. Stenetorp, S. Pyysalo, G. Topic, T. Ohta, S. Ananiadou, J. Tsujii, brat: a web-based
     tool for nlp-assisted text annotation, The 3th Conference of the European Chapter of the
     Association for Computational Linguistics; Avignon, France (2012) 102–107.


                                               11
                  Healthcare Data Summarization via Medical Entity Recognition and Generative AI


 [6] A. Aronson, Effective mapping of biomedical text to the umls metathesaurus: The metamap
     program, Proceedings / AMIA ... Annual Symposium. AMIA Symposium 2001 (2001) 17–21.
 [7] G. Savova, J. Masanz, P. Ogren, J. Zheng, S. Sohn, K. Kipper-Schuler, C. Chute, Mayo
     clinical text analysis and knowledge extraction system (ctakes): Architecture, component
     evaluation and applications, Journal of the American Medical Informatics Association :
     JAMIA 17 (2010) 507–13. doi:10.1136/jamia.2009.001560 .
 [8] H. Wu, G. Toti, K. Morley, Z. Ibrahim, A. Folarin, R. Jackson, I. Kartoglu, A. Agrawal,
     C. Stringer, D. Gale, G. Gorrell, A. Roberts, M. Broadbent, R. Stewart, R. Dobson, Semehr:
     A general-purpose semantic search system to surface semantic data from clinical notes
     for tailored care, trial recruitment, and clinical research, Journal of the American Medical
     Informatics Association 25 (2018) 160. doi:10.1093/jamia/ocx160 .
 [9] J. Cheng, M. Lapata, Neural summarization by extracting sentences and words, 2016.
     doi:10.18653/v1/P16- 1046 .
[10] R. Nallapati, F. Zhai, B. Zhou, Summarunner: A recurrent neural network based sequence
     model for extractive summarization of documents, Proceedings of the AAAI Conference
     on Artificial Intelligence 31 (2016). doi:10.1609/aaai.v31i1.10958 .
[11] E. Egonmwan, Y. Chali, Transformer-based model for single documents neural summa-
     rization, 2019. doi:10.18653/v1/D19- 5607 .
[12] Y. Liu, M. Lapata, Text summarization with pretrained encoders, 2019.
[13] K. Song, X. Tan, T. Qin, J. Lu, T.-Y. Liu, Mass: Masked sequence to sequence pre-training
     for language generation, 2019. arXiv:1905.02450 .
[14] L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang, J. Gao, M. Zhou, H.-W. Hon, Unified
     language model pre-training for natural language understanding and generation, 2019.
     arXiv:1905.03197 .
[15] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, L. Zettle-
     moyer, Bart: Denoising sequence-to-sequence pre-training for natural language generation,
     translation, and comprehension, 2019. arXiv:1910.13461 .
[16] A. Johnson, T. Pollard, M. Roger, ”mimic-iii clinical database” (version 1.4), 2016. doi:10.
     13026/C2XW26 .
[17] A. Johnson, T. Pollard, L. Shen, L.-w. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits,
     L. Celi, R. Mark, Mimic-iii, a freely accessible critical care database, Scientific Data 3 (2016)
     160035. doi:10.1038/sdata.2016.35 .
[18] A. Goldberger, L. Amaral, L. Glass, S. Havlin, J. Hausdorg, P. Ivanov, R. Mark, J. Mietus,
     G. Moody, C.-K. Peng, H. Stanley, P. Physiobank, Components of a new research resource
     for complex physiologic signals, PhysioNet 101 (2000).
[19] H. Eyre, A. Chapman, K. Peterson, J. Shi, P. Alba, M. Jones, T. Box, S. DuVall, O. Patterson,
     Launching into clinical space with medspacy: a new clinical text processing toolkit in
     python, AMIA ... Annual Symposium proceedings. AMIA Symposium 2021 (2022) 438–447.
[20] Z. Kraljevic, D. Bean, A. Mascio, L. Roguski, A. Folarin, A. Roberts, R. Bendayan, R. Dobson,
     Medcat – medical concept annotation tool, 2019. arXiv:1912.10166 .
[21] S. Mohan, D. Li, Medmentions: A large biomedical corpus annotated with umls concepts,
     2019.


                                                 12