<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Knowledge-Based Systems</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Sanivert: Transformers-based End2End System for Speech Recognition in Spanish, Catalan and Portuguese in Healthcare</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Pedro José Vivancos-Vicente</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Juan Salvador Castejón-Garrido</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ronghao Pan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Camilo Caparrós-Laiz</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>José Antonio García-Díaz</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rafael Valencia-García</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Departamento de Informática y Sistemas, Universidad de Murcia, Campus de Espinardo</institution>
          ,
          <addr-line>30100 Murcia</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>VÓCALI SISTEMAS INTELIGENTES S.L. Parque Científico de Murcia, Carretera de Madrid km 388. Complejo de Espinardo</institution>
          ,
          <addr-line>30100 Murcia</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>56</volume>
      <issue>2014</issue>
      <fpage>19</fpage>
      <lpage>20</lpage>
      <abstract>
        <p>This project focuses on improving VÓCALI's Automatic Speech Recognition systems, especially in the healthcare domain. Despite the impressive performance of state-of-the-art models such as Whisper in speech recognition, the lack of healthcarespecific training data hinders their efectiveness in this crucial domain. In addition, the problem of unpunctuated output in most of the ASR systems reduces readability, especially in scenarios with ambiguous interpretation. To address these challenges, the Sanivert project aims to adapt speech recognition models using end-2-end deep learning approaches and to develop post-processing systems for punctuation and capitalization restoration, which are crucial for improving the quality of speech recognition output. In addition, the project incorporates information extraction techniques such as named entity recognition and relation extraction to facilitate the extraction of clinical knowledge from dictated reports. With a focus on Spanish, Catalan and Portuguese, the project aligns with VÓCALI's existing solutions while strategically expanding into new markets. Ultimately, VÓCALI aims to create more adaptable and accurate ASR systems tailored to diferent languages and clinical specialties, ensuring improved performance in healthcare and other domains.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;ASR</kwd>
        <kwd>punctuation restoration</kwd>
        <kwd>knowledge extraction</kwd>
        <kwd>natural language processing</kwd>
        <kwd>medical domain</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction and main objective</title>
      <p>This project is funded by the Spanish Government and
the Digital Transformation Ministry and by the
European Union - NextGenerationEU under the “Plan de
Recuperación, Transformación y Resiliencia”, under the
2021 call of research projects in Artificial Intelligence
and other digital technologies and their integration in
value chains.</p>
      <p>Currently, VÓCALI specializes in the development
of Natural Language Processing (NLP) and Automatic
Speech Recognition (ASR) systems in various domains
such as home automation, robotics, telephony, or
public administration, with healthcare being the company’s
largest market niche.</p>
      <p>With the rapid development of Transformers and
pretrained models, ASR models such as Conformer [1],
Wav2Vec 2.0 [2], HuBERT [3] or Whisper [4] among
others have demonstrated very good performance due
to their transfer learning capabilities, which use prior
knowledge learned during the pre-training phase and
transfer this knowledge to specific tasks with relatively
little additional training data.</p>
      <p>Among the aforementioned models, Whisper [4]
stands out because it has demonstrated human-level
performance. It is a model pretrained in ASR on 680k hours
of weakly supervised data and is able to generalize with
50% more robustness and fewer errors in zero-shot
performance, i.e. the ability to perform speech transcription
tasks without having been specifically trained on those
tasks or without needing explicit training examples for
each specific task. However, due to data protection and
patient privacy issues, the training set does not include
audio from the healthcare sector and the results for this
sector is not good enough for commercial purposes.
Another limitation of most ASR systems is that they
produce unpunctuated output, which significantly reduces
readability and overall comprehension, especially in
scenarios where interpretation is ambiguous. Therefore, the
restoration of punctuation and capitalization is one of
the most important post-processing tasks in ASR systems
[5].
• (OB1) Generation of linguistic resources by
medical specialty and language.
• (OB2) Generation of E2E models for ASR based
on Transformers.
• (OB3) Development of a score retrieval system.
• (OB4) Development of a clinical knowledge
extraction system.</p>
      <p>• (OB5) Integration of previous modules.</p>
      <p>In this project, the technologies developed will be
applied in three languages: Spanish, Catalan and
Portuguese. The selection of Spanish is due to the fact that
it is the language on which most of VÓCALI’s solutions
are based and therefore the best language to validate,
and the selection of Catalan and Portuguese is a strategic
business decision to reach Catalonia, Portugal and Brazil.</p>
    </sec>
    <sec id="sec-2">
      <title>2. System architecture</title>
      <sec id="sec-2-1">
        <title>The architecture of the proposed system is shown in ifgure 1. A brief description of each of these components is given below.</title>
        <p>2.1. E2E ASR
As mentioned above, current ASR systems perform very
well in Spanish. In fact, Whisper [4] has a very low word
error rate (WER) in Spanish, reaching 4.2 in its large-v2
version for the Multilingual LibriSpeech (MLS) dataset
[7]. However, the WER increases significantly in domains
such as medicine, which contain very specialized terms
that the system does not understand correctly. Table 1
shows an example of some of these words in Spanish,
Catalan and Portuguese.</p>
        <p>On the other hand, in the medical field, physicians To address these issues, fine-tuning was performed for
face the challenge of managing large volumes of narra- each language using a proprietary dataset of audio pairs
tive documents containing critical patient information. and their transcriptions to adapt the Whisper versions
Information extraction techniques such as named entity to the medical domain. It should be noted that other
recognition and relation extraction play a key role in existing datasets related of clinical report were used to
structuring and extracting valuable insights from clinical expand the training set, such as CodiEsp [8], E3C [9],
reports[6]. and MTSamples 1. However, these datasets are not
multi</p>
        <p>The main objective of Sanivert is to adapt and improve modal because they lack audio. For this reason, we used
the VÓCALI’s ASR models to incorporate Deep Learning Text-to-Speech models such as Coqui-TTS 2 to transform
technologies based on end-to-end (E2E) technologies, so the texts into audio with diferent real human voices.
that the systems and models generated do not depend Coqui TSS includes tools for training new models and
on the isolated development of acoustic, phonological adapting existing models to any language. For Catalan,
and linguistic models as is the case with current systems. we used a model trained from scratch with three datasets:
This greatly facilitates the adaptation of current solutions Festcat, OpenSLR69 and Common Voice v12. This model,
to new languages and new clinical specialties. In addi- based on Coqui TSS, is called
projecte-aina/tts-ca-coquition, scoring and capitalization retrieval systems have vits-multispeaker3.
been developed, as well as systems for extracting clinical Currently, Whisper has 5 diferent configurations of
knowledge from dictated reports. variable size: tiny, basic, small, medium and large. An
This main objective is divided into 5 objectives. evaluation of these models is being conducted by
finetuning them to determine which are better suited for the
medical domain. In addition, other Transformers-based
ASR systems have been tested with good results to move
these models into production.
2.2. Punctuation and capitalization</p>
        <p>restoration</p>
      </sec>
      <sec id="sec-2-2">
        <title>Punctuation restoration is a post-processing task in Nat</title>
        <p>ural Language Generation (NLG) that consists in adding
capitalization and punctuation symbols to a text, as much
of ASR does not provide this data, hindering language
understanding and limiting the performance of automatic
text classification models.</p>
        <p>Nowadays, few models incorporate punctuation and
capitalization restoration. Whisper [4], for example,
incorporates both systems. To do this, Whisper processes
the audio in the encoder, generating hidden states that
the decoder will use to generate the tokens, restricting
the output to the input text with the addition of
punctuation.</p>
        <p>Our punctuation and capitalization restoration model
is based on a Transformers model used for sequence
labeling for Spanish, Catalan and Portuguese [5, 10]. In
addition, they are also able to restore capitalization
improving the detection of medical entities. Both
punctuation and capitalization recovery work with the same
model.</p>
        <p>For the training of punctuation and capitalization
restoration model, we rely on the OpusParaCrawl [11],
which contains data in Spanish, Catalan and Portuguese.
This dataset contains a total of 17.2 million phrases and</p>
      </sec>
      <sec id="sec-2-3">
        <title>1https://mtsamples.com/</title>
        <p>2https://github.com/coqui-ai/TTS
3https://huggingface.co/projecte-aina/
tts-ca-coqui-vits-multispeaker
774.58 million words written in Spanish. For Portuguese, 2.3. Knowledge extraction
we use the “pt-en” partition, which consists of 84.9
million sentences and 2.74G words written in Portuguese. The knowledge extraction system is capable of
extractTwo extensions have been made to this dataset. First, ing and annotating natural language text with
mediwe include an augmentation for Catalan based on the cal concepts based on the use of ontologies and
enwork described at [12], in which the authors consider tity recognition and semantic annotation technologies.
replacing of words with unknown, random insertion and This module is based on previous work of the research
random elimination. We use this method but with a back- group[13, 14, 15]. First, medical terms are recognized
translation technique that consists of first translating the using ontologies, for which we have compiled several
text into a specific language and then back-translating it lists of medical concepts and related them using an
ontolinto its original language. When performing this transla- ogy that provides the structure from which new
knowltion across diferent languages, translation models often edge can be inferred. In addition, medical entities and
replace some words with synonyms or generate new other data are recognized in the reports. Finally,
tempophrases with a similar meaning. Secondly, since this cor- ral expressions and quantities are detected using regular
pus does not contain texts related to the health sector, expressions.
we have expanded it with texts from electronic clinical The extracted information can be exported in
stanreports that VÓCALI has, with the aim of adapting and dard formats to facilitate interoperability with external
improving the results in the health sector. HIS/HCE-based systems. The HL7 FHIR [16] standard</p>
        <p>To train these models, we first cleaned the dataset and is used for this purpose. The system allows the
generadivided it into a set of tokens. We then made a custom tion of the electronic prescription using the recognized
division of the dataset into training, evaluation and test entities such as drugs, diseases, time expressions and
persets, and, finally trained the Transformers models using sons. Once the data has been recognized and structured,
the sequence tagging approach. it must be exported into the format specified by HL7
FHIR using partially instantiated templates. In addition,
diagnostic tests can be generated from the data dictated</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Future work</title>
      <p>by the professional by adding SNOMED CT [17] codes
so that the concept being treated can be easily identified.</p>
      <sec id="sec-3-1">
        <title>The completion of this project marks a significant ad</title>
        <p>2.4. Integration with external services vancement in the development of VOCALI’s ASR
systems, particularly in the healthcare domain. However,
These newly developed modules are currently being in- there are some enhancements that will be developed in
tegrated into VÓCALI’s existing systems. This will al- the near future.
low us to quantify the real improvement of this process An important direction for future work is the
refinein terms of quality and process performance. Further- ment and optimization of ASR models, with a particular
more, taking into account that the models and resources focus on healthcare contexts. The acquisition and
intedeveloped during this work are more computationally gration of more diverse and specialized healthcare audio
intensive, two levels of integration systems have been data into the training sets would be crucial. This step
implemented, one with lightweight components to pro- aims to overcome current limitations due to privacy
convide real-time response and feedback, and another more cerns surrounding healthcare data, thus enabling ASR
accurate system with the rest of the functionality. In ad- models to better capture the nuances and subtleties of
dition, to achieve the desired performance, deployment medical terminology and scenarios.
techniques based on Docker containers and dynamic per- Improving post-processing systems for punctuation
server deployment were applied, which can dynamically and capitalization restoration is critical, with a
parscale and respond to diferent levels of demand. ticular focus on overcoming the challenges posed by</p>
        <p>Besides, a web application can be used to access the medical acronyms and abbreviations. Advanced
algosystem and help physicians process all the information rithms tailored to accurately restore punctuation and
capderived from the medical report. The figure 2 shows a italization while efectively deciphering medical terms
screenshot of this interface. On the left, we can see the are needed, possibly using context-aware models and
transcription of the medical report. On the right, we can domain-specific dictionaries or ontologies. These
imsee the identified entities, which are grouped below into provements are aimed at improving the readability and
sections such as diagnostic test or drug prescription. comprehension of transcribed clinical reports, benefiting
healthcare professionals with clearer and more
understandable transcriptions, and ultimately contributing to
the usability and reliability of ASR systems in medical
contexts.</p>
        <p>Finally, future work could explore the feasibility and
efectiveness of adapting the ASR systems to additional
languages, thereby extending the reach and applicability
of the technology developed.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgments</title>
      <sec id="sec-4-1">
        <title>This work was funded by the Spanish Government,</title>
        <p>Ministerio para la Transformación Digital y la Función
Pública through the "Recovery, Transformation and
Resilience Plan" and also funded by the European Union
NextGenerationEU/PRTR through the research project
2021/C005/0015007</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>