1. Introduction and main objective

Knowledge-Based Systems

Sanivert: Transformers-based End2End System for Speech Recognition in Spanish, Catalan and Portuguese in Healthcare

Pedro José Vivancos-Vicente

Juan Salvador Castejón-Garrido

Ronghao Pan

Camilo Caparrós-Laiz

José Antonio García-Díaz

Rafael Valencia-García

0 0 Departamento de Informática y Sistemas, Universidad de Murcia, Campus de Espinardo , 30100 Murcia , Spain 1 VÓCALI SISTEMAS INTELIGENTES S.L. Parque Científico de Murcia, Carretera de Madrid km 388. Complejo de Espinardo , 30100 Murcia , Spain

2024

56 2014 19 20

This project focuses on improving VÓCALI's Automatic Speech Recognition systems, especially in the healthcare domain. Despite the impressive performance of state-of-the-art models such as Whisper in speech recognition, the lack of healthcarespecific training data hinders their efectiveness in this crucial domain. In addition, the problem of unpunctuated output in most of the ASR systems reduces readability, especially in scenarios with ambiguous interpretation. To address these challenges, the Sanivert project aims to adapt speech recognition models using end-2-end deep learning approaches and to develop post-processing systems for punctuation and capitalization restoration, which are crucial for improving the quality of speech recognition output. In addition, the project incorporates information extraction techniques such as named entity recognition and relation extraction to facilitate the extraction of clinical knowledge from dictated reports. With a focus on Spanish, Catalan and Portuguese, the project aligns with VÓCALI's existing solutions while strategically expanding into new markets. Ultimately, VÓCALI aims to create more adaptable and accurate ASR systems tailored to diferent languages and clinical specialties, ensuring improved performance in healthcare and other domains.

eol>ASR punctuation restoration knowledge extraction natural language processing medical domain

1. Introduction and main objective

This project is funded by the Spanish Government and the Digital Transformation Ministry and by the European Union - NextGenerationEU under the “Plan de Recuperación, Transformación y Resiliencia”, under the 2021 call of research projects in Artificial Intelligence and other digital technologies and their integration in value chains.

Currently, VÓCALI specializes in the development of Natural Language Processing (NLP) and Automatic Speech Recognition (ASR) systems in various domains such as home automation, robotics, telephony, or public administration, with healthcare being the company’s largest market niche.

With the rapid development of Transformers and pretrained models, ASR models such as Conformer [1], Wav2Vec 2.0 [2], HuBERT [3] or Whisper [4] among others have demonstrated very good performance due to their transfer learning capabilities, which use prior knowledge learned during the pre-training phase and transfer this knowledge to specific tasks with relatively little additional training data.

Among the aforementioned models, Whisper [4] stands out because it has demonstrated human-level performance. It is a model pretrained in ASR on 680k hours of weakly supervised data and is able to generalize with 50% more robustness and fewer errors in zero-shot performance, i.e. the ability to perform speech transcription tasks without having been specifically trained on those tasks or without needing explicit training examples for each specific task. However, due to data protection and patient privacy issues, the training set does not include audio from the healthcare sector and the results for this sector is not good enough for commercial purposes. Another limitation of most ASR systems is that they produce unpunctuated output, which significantly reduces readability and overall comprehension, especially in scenarios where interpretation is ambiguous. Therefore, the restoration of punctuation and capitalization is one of the most important post-processing tasks in ASR systems [5]. • (OB1) Generation of linguistic resources by medical specialty and language. • (OB2) Generation of E2E models for ASR based on Transformers. • (OB3) Development of a score retrieval system. • (OB4) Development of a clinical knowledge extraction system.

• (OB5) Integration of previous modules.

In this project, the technologies developed will be applied in three languages: Spanish, Catalan and Portuguese. The selection of Spanish is due to the fact that it is the language on which most of VÓCALI’s solutions are based and therefore the best language to validate, and the selection of Catalan and Portuguese is a strategic business decision to reach Catalonia, Portugal and Brazil.

2. System architecture The architecture of the proposed system is shown in ifgure 1. A brief description of each of these components is given below.

2.1. E2E ASR As mentioned above, current ASR systems perform very well in Spanish. In fact, Whisper [4] has a very low word error rate (WER) in Spanish, reaching 4.2 in its large-v2 version for the Multilingual LibriSpeech (MLS) dataset [7]. However, the WER increases significantly in domains such as medicine, which contain very specialized terms that the system does not understand correctly. Table 1 shows an example of some of these words in Spanish, Catalan and Portuguese.

On the other hand, in the medical field, physicians To address these issues, fine-tuning was performed for face the challenge of managing large volumes of narra- each language using a proprietary dataset of audio pairs tive documents containing critical patient information. and their transcriptions to adapt the Whisper versions Information extraction techniques such as named entity to the medical domain. It should be noted that other recognition and relation extraction play a key role in existing datasets related of clinical report were used to structuring and extracting valuable insights from clinical expand the training set, such as CodiEsp [8], E3C [9], reports[6]. and MTSamples 1. However, these datasets are not multi

The main objective of Sanivert is to adapt and improve modal because they lack audio. For this reason, we used the VÓCALI’s ASR models to incorporate Deep Learning Text-to-Speech models such as Coqui-TTS 2 to transform technologies based on end-to-end (E2E) technologies, so the texts into audio with diferent real human voices. that the systems and models generated do not depend Coqui TSS includes tools for training new models and on the isolated development of acoustic, phonological adapting existing models to any language. For Catalan, and linguistic models as is the case with current systems. we used a model trained from scratch with three datasets: This greatly facilitates the adaptation of current solutions Festcat, OpenSLR69 and Common Voice v12. This model, to new languages and new clinical specialties. In addi- based on Coqui TSS, is called projecte-aina/tts-ca-coquition, scoring and capitalization retrieval systems have vits-multispeaker3. been developed, as well as systems for extracting clinical Currently, Whisper has 5 diferent configurations of knowledge from dictated reports. variable size: tiny, basic, small, medium and large. An This main objective is divided into 5 objectives. evaluation of these models is being conducted by finetuning them to determine which are better suited for the medical domain. In addition, other Transformers-based ASR systems have been tested with good results to move these models into production. 2.2. Punctuation and capitalization

restoration

Punctuation restoration is a post-processing task in Nat

ural Language Generation (NLG) that consists in adding capitalization and punctuation symbols to a text, as much of ASR does not provide this data, hindering language understanding and limiting the performance of automatic text classification models.

Nowadays, few models incorporate punctuation and capitalization restoration. Whisper [4], for example, incorporates both systems. To do this, Whisper processes the audio in the encoder, generating hidden states that the decoder will use to generate the tokens, restricting the output to the input text with the addition of punctuation.

Our punctuation and capitalization restoration model is based on a Transformers model used for sequence labeling for Spanish, Catalan and Portuguese [5, 10]. In addition, they are also able to restore capitalization improving the detection of medical entities. Both punctuation and capitalization recovery work with the same model.

For the training of punctuation and capitalization restoration model, we rely on the OpusParaCrawl [11], which contains data in Spanish, Catalan and Portuguese. This dataset contains a total of 17.2 million phrases and

1https://mtsamples.com/

2https://github.com/coqui-ai/TTS 3https://huggingface.co/projecte-aina/ tts-ca-coqui-vits-multispeaker 774.58 million words written in Spanish. For Portuguese, 2.3. Knowledge extraction we use the “pt-en” partition, which consists of 84.9 million sentences and 2.74G words written in Portuguese. The knowledge extraction system is capable of extractTwo extensions have been made to this dataset. First, ing and annotating natural language text with mediwe include an augmentation for Catalan based on the cal concepts based on the use of ontologies and enwork described at [12], in which the authors consider tity recognition and semantic annotation technologies. replacing of words with unknown, random insertion and This module is based on previous work of the research random elimination. We use this method but with a back- group[13, 14, 15]. First, medical terms are recognized translation technique that consists of first translating the using ontologies, for which we have compiled several text into a specific language and then back-translating it lists of medical concepts and related them using an ontolinto its original language. When performing this transla- ogy that provides the structure from which new knowltion across diferent languages, translation models often edge can be inferred. In addition, medical entities and replace some words with synonyms or generate new other data are recognized in the reports. Finally, tempophrases with a similar meaning. Secondly, since this cor- ral expressions and quantities are detected using regular pus does not contain texts related to the health sector, expressions. we have expanded it with texts from electronic clinical The extracted information can be exported in stanreports that VÓCALI has, with the aim of adapting and dard formats to facilitate interoperability with external improving the results in the health sector. HIS/HCE-based systems. The HL7 FHIR [16] standard

To train these models, we first cleaned the dataset and is used for this purpose. The system allows the generadivided it into a set of tokens. We then made a custom tion of the electronic prescription using the recognized division of the dataset into training, evaluation and test entities such as drugs, diseases, time expressions and persets, and, finally trained the Transformers models using sons. Once the data has been recognized and structured, the sequence tagging approach. it must be exported into the format specified by HL7 FHIR using partially instantiated templates. In addition, diagnostic tests can be generated from the data dictated

3. Future work

by the professional by adding SNOMED CT [17] codes so that the concept being treated can be easily identified.

The completion of this project marks a significant ad

2.4. Integration with external services vancement in the development of VOCALI’s ASR systems, particularly in the healthcare domain. However, These newly developed modules are currently being in- there are some enhancements that will be developed in tegrated into VÓCALI’s existing systems. This will al- the near future. low us to quantify the real improvement of this process An important direction for future work is the refinein terms of quality and process performance. Further- ment and optimization of ASR models, with a particular more, taking into account that the models and resources focus on healthcare contexts. The acquisition and intedeveloped during this work are more computationally gration of more diverse and specialized healthcare audio intensive, two levels of integration systems have been data into the training sets would be crucial. This step implemented, one with lightweight components to pro- aims to overcome current limitations due to privacy convide real-time response and feedback, and another more cerns surrounding healthcare data, thus enabling ASR accurate system with the rest of the functionality. In ad- models to better capture the nuances and subtleties of dition, to achieve the desired performance, deployment medical terminology and scenarios. techniques based on Docker containers and dynamic per- Improving post-processing systems for punctuation server deployment were applied, which can dynamically and capitalization restoration is critical, with a parscale and respond to diferent levels of demand. ticular focus on overcoming the challenges posed by

Besides, a web application can be used to access the medical acronyms and abbreviations. Advanced algosystem and help physicians process all the information rithms tailored to accurately restore punctuation and capderived from the medical report. The figure 2 shows a italization while efectively deciphering medical terms screenshot of this interface. On the left, we can see the are needed, possibly using context-aware models and transcription of the medical report. On the right, we can domain-specific dictionaries or ontologies. These imsee the identified entities, which are grouped below into provements are aimed at improving the readability and sections such as diagnostic test or drug prescription. comprehension of transcribed clinical reports, benefiting healthcare professionals with clearer and more understandable transcriptions, and ultimately contributing to the usability and reliability of ASR systems in medical contexts.

Finally, future work could explore the feasibility and efectiveness of adapting the ASR systems to additional languages, thereby extending the reach and applicability of the technology developed.

Acknowledgments This work was funded by the Spanish Government,

Ministerio para la Transformación Digital y la Función Pública through the "Recovery, Transformation and Resilience Plan" and also funded by the European Union NextGenerationEU/PRTR through the research project 2021/C005/0015007