Estimating the Quality of Translated Medical Texts using Back Translation & Resource Description Framework

Estimating the Quality of Translated Medical Texts using Back Translation & Resource Description Framework VinayNeekhra vinay.neekhra@research.iiit.ac.in DiptiMisra Language Technology Research Center Kohli Center on Intelligent Systems International Institute of Information Technology IIIT-Hyderabad

Hyderabad

Estimating the Quality of Translated Medical Texts using Back Translation & Resource Description Framework 1613-0073 7A8B50442E92755E311F1F173C474DF4 GROBID - A machine learning software for extracting information from scholarly documents Translation Quality Estimation Resource Description Framework (RDF) Back Translation GATE

How can we effectively estimate the quality of translated texts in the medical field, where back-translation is usually available and/or recommended for sensitive documents. This paper proposes a novel metric, GATE 1 , for translation quality estimation task, leveraging the Resource Description Framework (RDF) to encode both semantic and syntactical information of the original and back-translated sentences into RDF graphs. The distance between these graphs is measured to get the semantic similarity score to assess the quality of the translation. Unlike traditional metrics like BLEU and METEOR, our approach is reference-less, capturing both semantic and syntactical information for a comprehensive assessment of translation quality. Our results correlate better with human judgment, giving a better Pearson correlation (0.357) as compared to BLEU (0.200), thereby showing ~70% improvement over BLEU. Our research shows that, in the field of translation evaluation, existing resources like back-translation and RDF could be useful.

Introduction

A drug trial in the medical domain incorporates a mandatory consent form called a Medical Consent Form (MCF), which informs the patient about the experiment and its potential side effects. There is a legal requirement for the MCF to be in the patient's mother tongue and for it to be easy to understand. A human translator translates the original MCF into the patient's mother tongue. As MCFs are sensitive documents, evaluating the quality of translated texts is crucial to ensure faithfulness to the original texts (see Section 1.1 for an example).

One way to evaluate the quality of the translated texts is using back-translation (see Section 3.1), wherein the translated text is translated back into the original language. The original and back-translated texts are then compared to estimate the quality of the translation. Back-translation is a prominent way to assess the quality of translated texts in domains, such as medical documents, where accuracy and precision are paramount [1] [2].

Experienced professionals are responsible for carrying out all three procedures (see Figure 1), namely: initial translation from the source language to the target language, followed by translation from the target language back to the source language, and ultimately, comparison between the original text and the back-translated texts. Our efforts are focused on reducing the efforts of human evaluators comparing the original and back-translated texts by automating the task of evaluating the quality of translated texts.

While human evaluation has traditionally served as a benchmark for assessing translation quality, it is often expensive, time-consuming, and subjective. As an alternative, automatic evaluation metrics such as BLEU [3], METEOR [4], etc., have been developed to provide a more efficient and objective means of evaluation, with BLEU being the most commonly used metric (see Section 2 for related work). This field of research, called translation quality estimation (QE), is an area of research concerned with evaluating the quality of translated texts when gold standard translations (called reference texts) are unavailable.

In this paper, we propose a novel translation evaluation metric, GATE (Graphical Assessment for Translation quality Estimation), which leverages back-translation (see Section 3.1) and the Resource Description Framework (RDF) (see Section 3.2). GATE encodes both semantic and syntactical information of the original and back-translated sentences into RDF graphs, allowing for a reference-less, semanticallyaware assessment of translation quality.

For sensitive documents in the medical field, such as medical consent forms and qualitative research, back-translation is a common practice to ensure the faithfulness of translations [1] [2]. GATE capitalizes on this by integrating back-translation into its evaluation framework, providing a comprehensive and reliable assessment of translation quality. To estimate the quality of translated texts, we encode the meaning of these sentences into graphs using the Resource Description Framework (RDF) and then compare these graphs to come up with a similarity score (See Figure 4). GATE shows a higher correlation (0.357) with human judgment than BLEU (0.200). (see Section 4 for the experiment details). In the next Section 1.1, we discuss the significance of translation evaluation, highlighting the context and motivation behind our research efforts.

Significance of Translation Evaluation

Consider the following sentence from a medical consent form for a vaccine trial, translated to the patient's mother tongue (Tamil language) where the original consent form is in English.

• Source text: There are no side effects mentioned previously.

To comply with legal requirements, the consent form was translated into Tamil by hospital authorities, resulting in two translated versions. For evaluating the translation quality, the translated MCF was back-translated to English, yielding the following results:

• Back Translation 1: No side effects which were mentioned previously • Back Translation 2: It has already been mentioned that it does not have any side-effects As seen above, the first back-translated sentence is semantically similar to the source text and preserves the original intent. The second back translated text, on the other hand, conveys that -as previously mentioned, there are no side-effects-, whereas the original intent was that no side-effects have been observed yet, thus raising ethical and legal concerns. Thus, it is crucial, that translated texts are evaluated for their faithfulness to the original text, especially in the medical domain. In the next subsection, we highlight the contributions of our work.

Contributions

1. This paper presents a novel approach, GATE, for translation quality estimation task by utilizing back-translation and leveraging knowledge graphs (namely, Resource Description Framework) for encoding the meaning of original and back-translated texts to come up with a translation quality estimation score. 2. GATE incorporates both syntactic and semantic information, leading to improved evaluation scores. Our approach is applicable to both machine-translated and human-translated texts. Our experiments demonstrate a better correlation with human judgment compared to BLEU, with a Pearson correlation of 0.357 compared to the most commonly used metric, BLEU's 0.200. 3. Our approach eliminates the need for reference texts by comparing the source text directly with its back-translated counterpart. This makes our approach reference-less and thus valuable for scenarios where reference texts are not available for translation evaluation (such as medical consent forms). 4. While our results do not surpass the current state-of-the-art, our metric, GATE, offers distinct advantages such as requiring no training, being computationally lightweight, being available for low-resource languages, and operating without the need for extensive training data, unlike neural network-based methods like COMET [5].

The paper is structured as follows: Section 2 reviews related work in the area of translation evaluation, discussing the limitations of existing metrics. Section 3 builds the foundation of our work, providing an overview of back-translation along with its significance, introduces Knowledge Graphs in general, and describes Resource Description Framework (RDF) and FRED RDF graphs. Section 4 details the experiment design and methodology leading to the creation of GATE. The results of our experiments are presented in Section 5, along with a discussion of the insights gained from our research efforts while also addressing the current limitations of our metric. Finally, Section 6 and Section 7 conclude the paper along with outlining the directions for future research.

Related work

Existing metrics for translation evaluation, such as BLEU [3], METEOR [4], NIST [6], and TER [7], have been widely utilized in the field, with BLEU being the most commonly used among them. BLEU compares the translated sentence with a reference sentence. It operates on word group matching using an n-gram model and remains popular due to its simplicity. In contrast, METEOR was developed as a successor to BLEU to account for synonyms and other variations in language. Usually, the quality of translation is evaluated at the sentence level, but word and document level QE are also possible [8].

However, these metrics have inherent limitations. Many traditional metrics are categorized as n-gram matching metrics, relying on handcrafted features to estimate translation quality by counting the number and fraction of n-grams shared between a candidate translation hypothesis and one or more human references. This restricts their ability to capture nuanced meaning, particularly in complex and domain-specific texts. They often rely on surface-level similarity measures and may necessitate reference translations, typically provided by humans as a standard of perfection.

More recent approaches have explored the use of word embeddings as an alternative to n-gram matching for capturing word semantic similarity. Metrics like BLEU2VEC [9], BERT SCORE [10], and COMET [5] create alignments between reference and hypothesis segments in an embedding space to compute a score reflecting semantic similarity. COMET, a notable metric in this domain, has demonstrated remarkable results for translation evaluation. However, to train these models, the availability of word embeddings for low-resource languages remains a significant challenge.

However, these metrics may still need to catch up in capturing the full range of nuances captured by human judgments. Challenges with existing metrics include their reliance on reference texts for comparison, requiring semantic exactness at the word level, susceptibility to differences in lexical structure (such as word order), and the tendency to measure semantic relatedness rather than semantic similarity, huge data requirement for training models thus not well-suited for low-resource languages.

Preliminaries

This section lays out the foundation required for our experiment design.

Back Translation:

Back translation is a process where a translated text is translated back into the original language (source language) by a different translator [11]. In Figure 1, translation and back-translation processes between English and French are illustrated, as depicted by [12]. Back translation is recommended in the domains where the content subjected to translation is too sensitive and needs to be double-checked. The back-translation method is widely used in medical research and clinical trials, as it is required by Ethics Committees and regulatory authorities in several countries [1]. This allows us to compare the back-translated text with the original text to evaluate the quality of the translation.

The rationale behind using back-translation is that for sensitive documents in the medical domain, back-translation is a recommended practice to cross-verify that the translation adheres to the intended meaning. Usually, back-translation is mandatory in case of quality assessment of medical consent forms, so this is not an overhead in this particular scenario and is generally recommended for medical, legal, market research, and government agencies working in public health, safety, and legal matters. We are utilizing this for translation evaluation. We aim to address the specific needs of these domains to ensure the faithfulness of the translated texts. Our efforts are to use already available back-translation texts for the translation evaluation tasks.

Resource Description Framework

The Resource Description Framework (RDF) is a W3C standard for data representation on the Web. RDF provides a foundation for encoding information in a structured way for the Semantic Web [13]. It is particularly useful for representing knowledge about entities and the relationships between them.

Components of RDF

RDF consists of triplets, which are fundamental units of information. These triplets, also known as RDF triples, form the building blocks for representing knowledge within an RDF graph. Each RDF triple is composed of three elements:

1. Subject: The resource (entity) being described. (e.g., "The patient") 2. Predicate: The property or characteristic of the subject, denoted by directed arrows. (e.g., "has diagnosisof") 3. Object: The value associated with the predicate for the subject. (e.g., "pneumonia") In Figure 2, the RDF triple depicts a statement about a patient having a diagnosis of pneumonia. In the context of our research, we leverage RDF to capture the semantics of the sentences, enabling a more nuanced evaluation of translation quality compared to traditional metrics.

FRED RDF Graphs

Our research is based on RDF graphs provided by FRED (Framework for RDF-based Extraction and Disambiguation) [14] to capture semantic nuances in translated texts. At its core, FRED leverages the Resource Description Framework (RDF) to construct semantic graphs that capture the relationships and entities present in the text. FRED bridges the gap between unstructured text and structured knowledge representation, employing Semantic Web technologies to extract and disambiguate information from textual data. Figure 3 shows the RDF graph for the sentence "An experimental drug is one which has not been approved by FDA. ".

Experiment Design

We conduct a comparative experiment to evaluate the efficacy of our proposed RDF-based evaluation metric, GATE, in comparison to the baseline metric BLEU and its correlation with human judgment. To obtain baseline BLEU scores, we are using iBLEU [15]. The evaluation procedure, outlined in Algorithm 1, explains the comparison of RDF graphs generated through the FRED API, which can be accessed at http://wit.istc.cnr.it/stlab-tools/fred/demo/.

Dataset

Our experiments were done on the selected medical consent forms and the sentences from Semantic Textual Similarity (STS) Benchmark Dataset [16] to evaluate the effectiveness of GATE in capturing semantic similarity compared to BLEU. The medical consent forms dataset has around 250 original sentence, their corresponding translations, and the back-translated texts, all provided by human translators. Due to the selected availability of medical data, we augmented our analysis with the STS benchmark dataset. In total, our experiments were conducted on 500 sentence pairs, with 250 pairs sourced from medical consent forms provided by a medical institute.

Graph comparison and GATE Score

We are comparing the source sentence with the back-translated text by constructing RDF graphs for both. The distance between graphs is measured as the Jaccard similarity coefficient [17] between the entities in the graphs. This way, the distance between the source and the back-translated sentence graph is normalized between 0 and 1, where 1 denotes an exact match, and 0 denotes no similarity. Algorithm 1 outlines the steps in the evaluation process. Specifically, for source sentence s 𝑘 , and the back-translated text b 𝑘 , the GATE Score is calculated as follows: b 𝑘 ← back-translation of t 𝑘 (either already available or obtained using Google Translate)

𝐺 𝑘 = 𝑒𝑛𝑡𝑖𝑡𝑖𝑒𝑠(s 𝑘 ) ∩ 𝑒𝑛𝑡𝑖𝑡𝑖𝑒𝑠(b 𝑘 ) 𝑒𝑛𝑡𝑖𝑡𝑖𝑒𝑠(s 𝑘 ) ∪ 𝑒𝑛𝑡𝑖𝑡𝑖𝑒𝑠(b 𝑘 )

𝑒𝑛𝑡𝑖𝑡𝑖𝑒𝑠(s 𝑘 ) ← RDF graph nodes of s 𝑘 using FRED G 𝑘 ← 𝑐𝑜𝑚𝑚𝑜𝑛 𝑢𝑛𝑖𝑠𝑜𝑛 11:

12: end for

In the next section, we present the findings of our experiments along with a discussion of the insights gained from our research efforts while also addressing the current limitations of our metric.

Results & Discussion

Our experiment implemented the proposed GATE metric alongside the baseline metric, BLEU. We calculated the Pearson correlation between the BLEU score and GATE score against human judgment on the experiment dataset. Our results in Table 1, show that GATE achieves a significantly higher correlation with human judgment in translation evaluation tasks compared to the widely used metric, BLEU. Specifically, GATE exhibits a ~70% improvement in correlation on the experiment data, with a Pearson correlation coefficient of 0.357 compared to BLEU's 0.200. The higher correlation underscores the effectiveness of leveraging RDF graphs in capturing semantic information, thereby improvement in correlation with human judgments. Table 2 shows examples with corresponding human evaluation scores, GATE scores, and BLEU scores. These examples serve to highlight GATE's capability to better reflect human perception of semantic similarity, as evidenced by its closer alignment with human judgments compared to BLEU scores. In summary, our findings indicate that integrating RDF graphs with already existing back-translated texts holds promise for reference-free translation evaluation. This metric can potentially assist human evaluators who evaluate the translation of sensitive documents using back-translated texts.

Using RDF for translation evaluation could be helpful as they 'encode' real-world semantics akin to how embeddings work in neural network frameworks (such as COMET), contrasting with metrics that are based on lexical level information for translation evaluation (such as BLEU). This work has the potential to pave the way for utilizing knowledge graphs in the field of translation evaluation alongside existing resources, such as word embeddings and LLM-based frameworks. Our experiments reinforce our belief, demonstrating that using knowledge graphs to encode meaning is helpful and gets better results than the baseline metrics. A woman peels a potato A woman is peeling a potato. 1.00 1.00 0.52

Given that RDF is currently available only in English and our metric compares graphs of original and back-translated texts for translation evaluation, our metric is presently only applicable where English is the source language. However, the target language can be any other language as long as back-translation is available.

While our results do not surpass state-of-the-art performance, they serve as a proof-of-concept, showcasing the effectiveness of leveraging RDF graphs for translation evaluation tasks. As FRED accommodates large sentences as well, our future work will involve working with more extensive real-world translated medical data and testing our methodology on larger sentences to demonstrate its effectiveness comprehensively. These results underscore the advantages of GATE over traditional metrics like BLEU and motivate further validation of GATE's applicability on real-world data particularly in domains like medicine, along with continuing our exploration for further improvement of the metric.

Conclusion

In this paper, we introduce GATE, a novel metric based on the Resource Description Framework (RDF) designed for assessing the quality of translated medical texts for which back-translation is available. To showcase the effectiveness of our metric, we conducted experiments using selected medical data and the STS benchmark dataset, comparing the results against the baseline metric, BLEU, and human judgment scores. Notably, GATE exhibits a stronger correlation with human judgment than BLEU, achieving a higher Pearson correlation coefficient (0.357 compared to BLEU's 0.200), representing approximately a ~70% improvement over BLEU, the most commonly used metric.

By leveraging back-translation and using RDF graphs to encode both semantic and syntactical information, GATE provides a reference-less and semantically aware assessment of translation quality. In comparison with the more advanced Large Language Model (LLM)-based metrics such as COMET, our metric is computationally much lighter. It works for any target language, including low-resource languages, and does not require any data training. Our research shows that, in the field of translation evaluation, existing resources like back-translation and Resource Description Framework could be helpful in real-world scenarios such as the medical domain.

Future Directions

As part of future work, we would like to explore:

1. Conducting further experiments to validate the efficacy of GATE on real-world translated medical data. 2. Since Translation and Summarization can both be viewed as natural language generation from a textual context, we aim to explore knowledge graphs such as RDF in the area of evaluating summarization or similar natural language generation tasks. Investigate the utilization of knowledge graphs for tasks beyond translation evaluation, such as summarization. 3. For calculating GATE score, experimenting with different formulas incorporating variations in weights of entities, incoming edges, and outgoing edges.

4. Addressing the challenge of language dependency in GATE by incorporating multilingual knowledge graphs since FRED works only with English texts. A primary avenue for future work, will be looking into the inclusion of other knowledge graphs available in other languages, making GATE language independent. 5. Development of a software similar to iBLEU for integrating FRED API to facilitate automatic scoring of source and back-translated texts, enhanced visualization, and accessibility of the RDF metric. 6. [18] shows that back-translation could be useful for improving the translation quality for lowresource languages. Our future work is to combine neural networks with back-translation and knowledge graphs in the area of translation evaluation for low-resource languages. Our future work aims to combine these technologies along with knowledge graphs (such as Knowledge Graph Embeddings) to improve our metric, making it suitable for evaluating translated sensitive texts and investigating the potential of combining neural networks with back-translation and knowledge graphs to improve translation quality, particularly for low-resource languages.

Figure 1 :1Figure 1: Example of Back Translation (best viewed in color)

Figure 2 :2Figure 2: RDF Triple for the sentence "The patient has diagnosis of pneumonia"

Figure 3 :3Figure 3: FRED RDF graph for "An experimental drug is one which has not been approved by FDA. " taken from a medical consent form.

Figure 4 :Algorithm 1 :41Figure 4: Graph Comparison for measuring semantic similarity. Common nodes are highlighted in multiple colors. In these two graphs there are 8 common nodes, and total unique nodes are 15. (best viewed in color)

5 :: 7 : 8 :578𝑒𝑛𝑡𝑖𝑡𝑖𝑒𝑠(b 𝑘 ) ← RDF graph nodes of b 𝑘 using FRED 6common ← {𝑥 | 𝑥 ∈ 𝑒𝑛𝑡𝑖𝑡𝑖𝑒𝑠(s 𝑘 ) and 𝑥 ∈ 𝑒𝑛𝑡𝑖𝑡𝑖𝑒𝑠(b 𝑘 )} unison ← {𝑥 | 𝑥 ∈ 𝑒𝑛𝑡𝑖𝑡𝑖𝑒𝑠(s 𝑘 ) or 𝑥 ∈ 𝑒𝑛𝑡𝑖𝑡𝑖𝑒𝑠(b 𝑘 )}

Table 11System-wide Pearson correlation of BLEU and GATE with human judgments on MCFs Data and STS BenchmarkDatasetMetric Pearson CorrelationBLEU0.200GATE0.357

Table 22GATE vs. BLEU score against human evaluation. Selected examples from the experiment run on STS dataset. Higher correlation with human judgment are marked in bold.Serial HypothesisReferenceHuman GATE BLEU1.A man is erasing a chalk board The man is erasing the chalk board1.000.650.602.Three men are playing guitarsThree men on stage are playing guitars0.750.450.603.A woman is carrying a boyA woman is carrying her baby0.470.530.634.

Acknowledgments

Our sincere gratitude to the late Prof. Ravi Kothari, on whose suggestions this research work was started. We thank anonymous reviewers for their time and valuable suggestions for improving the paper. We also express our gratitude to Supriya Ranjan, Bhavesh Neekhra, Mamatha Alugubelly, and others for their invaluable feedback and support.

Back translation for quality control of informed consent forms DGrunwald NGoldfarb 2006 2 Translation and back-translation in qualitative nursing research: methodological review H.-YChen JRBoore Journal of Clinical Nursing 19 2010 BLEU: a method for automatic evaluation of machine translation KPapineni SRoukos TWard W.-JZhu 10.3115/1073083.1073135 Association for Computational Linguistics 2002 METEOR: An automatic metric for MT evaluation with improved correlation with human judgments SBanerjee ALavie Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Association for Computational Linguistics the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Association for Computational Linguistics 2005 COMET-22: Unbabel-IST 2022 submission for the metrics shared task RRei JG CDe Souza DAlves CZerva ACFarinha TGlushkova ALavie LCoheur AF TMartins Proceedings of the Seventh Conference on Machine Translation (WMT), Association for Computational Linguistics the Seventh Conference on Machine Translation (WMT), Association for Computational Linguistics

Abu Dhabi, United Arab Emirates; Hybrid

2022 Automatic evaluation of machine translation quality using n-gram co-occurrence statistics GDoddington Proceedings of the second international conference on Human Language Technology Research the second international conference on Human Language Technology Research 2002 A study of translation edit rate with targeted human annotation MSnover BDorr RSchwartz LMicciulla JMakhoul Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers 2006 Quality estimation for machine translation LSpecia CScarton GHPaetzold GHirst 2018 Springer 11 bleu2vec: the painfully familiar metric on continuous vector space steroids ATättar MFishel 10.18653/v1/W17-4771 Proceedings of the Second Conference on Machine Translation, Association for Computational Linguistics OBojar CBuck RChatterjee CFedermann YGraham BHaddow MHuck AJYepes PKoehn JKreutzer the Second Conference on Machine Translation, Association for Computational Linguistics

Copenhagen, Denmark

2017 TZhang VKishore FWu KQWeinberger YArtzi arXiv:1904.09675 Bertscore: Evaluating text generation with bert 2019 arXiv preprint What is back translation? A 2021 THTrinh TLe PHoang MLuong A tutorial on data augmentation by backtranslation 2019 World Wide Web Consortium, Resource description framework (rdf) syntax specification (revised) 1998 Semantic Web Machine Reading with FRED AGangemi VPresutti DRRecupero AGNuzzolese FDraicchio MMongiovã¬ Semantic Web 8 2017 ibleu: Interactively debugging and scoring statistical machine translation systems NMadnani 10.1109/ICSC.2011.36 IEEE Fifth International Conference on Semantic Computing 2011. 2011 Semeval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation DCer MDiab EAgirre ILopez-Gazpio LSpecia 10.18653/v1/s17-2001 Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Association for Computational Linguistics the 11th International Workshop on Semantic Evaluation (SemEval-2017), Association for Computational Linguistics 2017 Wikipedia contributors 2023 Jaccard similarity Enhanced back-translation for low resource neural machine translation using self-training IAbdulmumin BSGaladanci AIsa Information and Communication Technology and Applications SMisra BMuhammad-Bello

Cham

Springer International Publishing 2021