<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Estimating the Quality of Translated Medical Texts using Back Translation &amp; Resource Description Framework</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author role="corresp">
							<persName><forename type="first">Vinay</forename><surname>Neekhra</surname></persName>
							<email>vinay.neekhra@research.iiit.ac.in</email>
						</author>
						<author>
							<persName><forename type="first">Dipti</forename><surname>Misra</surname></persName>
						</author>
						<author>
							<affiliation key="aff0">
								<orgName type="department" key="dep1">Language Technology Research Center</orgName>
								<orgName type="department" key="dep2">Kohli Center on Intelligent Systems</orgName>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff1">
								<orgName type="department">International Institute of Information Technology</orgName>
								<orgName type="institution">IIIT-Hyderabad</orgName>
								<address>
									<settlement>Hyderabad</settlement>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Estimating the Quality of Translated Medical Texts using Back Translation &amp; Resource Description Framework</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">7A8B50442E92755E311F1F173C474DF4</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T16:50+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Translation Quality Estimation</term>
					<term>Resource Description Framework (RDF)</term>
					<term>Back Translation</term>
					<term>GATE</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>How can we effectively estimate the quality of translated texts in the medical field, where back-translation is usually available and/or recommended for sensitive documents. This paper proposes a novel metric, GATE 1 , for translation quality estimation task, leveraging the Resource Description Framework (RDF) to encode both semantic and syntactical information of the original and back-translated sentences into RDF graphs. The distance between these graphs is measured to get the semantic similarity score to assess the quality of the translation. Unlike traditional metrics like BLEU and METEOR, our approach is reference-less, capturing both semantic and syntactical information for a comprehensive assessment of translation quality. Our results correlate better with human judgment, giving a better Pearson correlation (0.357) as compared to BLEU (0.200), thereby showing ~70% improvement over BLEU. Our research shows that, in the field of translation evaluation, existing resources like back-translation and RDF could be useful.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>A drug trial in the medical domain incorporates a mandatory consent form called a Medical Consent Form (MCF), which informs the patient about the experiment and its potential side effects. There is a legal requirement for the MCF to be in the patient's mother tongue and for it to be easy to understand. A human translator translates the original MCF into the patient's mother tongue. As MCFs are sensitive documents, evaluating the quality of translated texts is crucial to ensure faithfulness to the original texts (see Section 1.1 for an example).</p><p>One way to evaluate the quality of the translated texts is using back-translation (see Section 3.1), wherein the translated text is translated back into the original language. The original and back-translated texts are then compared to estimate the quality of the translation. Back-translation is a prominent way to assess the quality of translated texts in domains, such as medical documents, where accuracy and precision are paramount <ref type="bibr" target="#b0">[1]</ref> <ref type="bibr" target="#b1">[2]</ref>.</p><p>Experienced professionals are responsible for carrying out all three procedures (see Figure <ref type="figure" target="#fig_0">1</ref>), namely: initial translation from the source language to the target language, followed by translation from the target language back to the source language, and ultimately, comparison between the original text and the back-translated texts. Our efforts are focused on reducing the efforts of human evaluators comparing the original and back-translated texts by automating the task of evaluating the quality of translated texts.</p><p>While human evaluation has traditionally served as a benchmark for assessing translation quality, it is often expensive, time-consuming, and subjective. As an alternative, automatic evaluation metrics such as BLEU <ref type="bibr" target="#b2">[3]</ref>, METEOR <ref type="bibr" target="#b3">[4]</ref>, etc., have been developed to provide a more efficient and objective means of evaluation, with BLEU being the most commonly used metric (see Section 2 for related work). This field of research, called translation quality estimation (QE), is an area of research concerned with evaluating the quality of translated texts when gold standard translations (called reference texts) are unavailable.</p><p>In this paper, we propose a novel translation evaluation metric, GATE (Graphical Assessment for Translation quality Estimation), which leverages back-translation (see Section 3.1) and the Resource Description Framework (RDF) (see Section 3.2). GATE encodes both semantic and syntactical information of the original and back-translated sentences into RDF graphs, allowing for a reference-less, semanticallyaware assessment of translation quality.</p><p>For sensitive documents in the medical field, such as medical consent forms and qualitative research, back-translation is a common practice to ensure the faithfulness of translations <ref type="bibr" target="#b0">[1]</ref> <ref type="bibr" target="#b1">[2]</ref>. GATE capitalizes on this by integrating back-translation into its evaluation framework, providing a comprehensive and reliable assessment of translation quality. To estimate the quality of translated texts, we encode the meaning of these sentences into graphs using the Resource Description Framework (RDF) and then compare these graphs to come up with a similarity score (See Figure <ref type="figure">4</ref>). GATE shows a higher correlation (0.357) with human judgment than BLEU (0.200). (see Section 4 for the experiment details). In the next Section 1.1, we discuss the significance of translation evaluation, highlighting the context and motivation behind our research efforts.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.1.">Significance of Translation Evaluation</head><p>Consider the following sentence from a medical consent form for a vaccine trial, translated to the patient's mother tongue (Tamil language) where the original consent form is in English.</p><p>• Source text: There are no side effects mentioned previously.</p><p>To comply with legal requirements, the consent form was translated into Tamil by hospital authorities, resulting in two translated versions. For evaluating the translation quality, the translated MCF was back-translated to English, yielding the following results:</p><p>• Back Translation 1: No side effects which were mentioned previously • Back Translation 2: It has already been mentioned that it does not have any side-effects As seen above, the first back-translated sentence is semantically similar to the source text and preserves the original intent. The second back translated text, on the other hand, conveys that -as previously mentioned, there are no side-effects-, whereas the original intent was that no side-effects have been observed yet, thus raising ethical and legal concerns. Thus, it is crucial, that translated texts are evaluated for their faithfulness to the original text, especially in the medical domain. In the next subsection, we highlight the contributions of our work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.2.">Contributions</head><p>1. This paper presents a novel approach, GATE, for translation quality estimation task by utilizing back-translation and leveraging knowledge graphs (namely, Resource Description Framework) for encoding the meaning of original and back-translated texts to come up with a translation quality estimation score. 2. GATE incorporates both syntactic and semantic information, leading to improved evaluation scores. Our approach is applicable to both machine-translated and human-translated texts. Our experiments demonstrate a better correlation with human judgment compared to BLEU, with a Pearson correlation of 0.357 compared to the most commonly used metric, BLEU's 0.200. 3. Our approach eliminates the need for reference texts by comparing the source text directly with its back-translated counterpart. This makes our approach reference-less and thus valuable for scenarios where reference texts are not available for translation evaluation (such as medical consent forms). 4. While our results do not surpass the current state-of-the-art, our metric, GATE, offers distinct advantages such as requiring no training, being computationally lightweight, being available for low-resource languages, and operating without the need for extensive training data, unlike neural network-based methods like COMET <ref type="bibr" target="#b4">[5]</ref>.</p><p>The paper is structured as follows: Section 2 reviews related work in the area of translation evaluation, discussing the limitations of existing metrics. Section 3 builds the foundation of our work, providing an overview of back-translation along with its significance, introduces Knowledge Graphs in general, and describes Resource Description Framework (RDF) and FRED RDF graphs. Section 4 details the experiment design and methodology leading to the creation of GATE. The results of our experiments are presented in Section 5, along with a discussion of the insights gained from our research efforts while also addressing the current limitations of our metric. Finally, Section 6 and Section 7 conclude the paper along with outlining the directions for future research.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related work</head><p>Existing metrics for translation evaluation, such as BLEU <ref type="bibr" target="#b2">[3]</ref>, METEOR <ref type="bibr" target="#b3">[4]</ref>, NIST <ref type="bibr" target="#b5">[6]</ref>, and TER <ref type="bibr" target="#b6">[7]</ref>, have been widely utilized in the field, with BLEU being the most commonly used among them. BLEU compares the translated sentence with a reference sentence. It operates on word group matching using an n-gram model and remains popular due to its simplicity. In contrast, METEOR was developed as a successor to BLEU to account for synonyms and other variations in language. Usually, the quality of translation is evaluated at the sentence level, but word and document level QE are also possible <ref type="bibr" target="#b7">[8]</ref>.</p><p>However, these metrics have inherent limitations. Many traditional metrics are categorized as n-gram matching metrics, relying on handcrafted features to estimate translation quality by counting the number and fraction of n-grams shared between a candidate translation hypothesis and one or more human references. This restricts their ability to capture nuanced meaning, particularly in complex and domain-specific texts. They often rely on surface-level similarity measures and may necessitate reference translations, typically provided by humans as a standard of perfection.</p><p>More recent approaches have explored the use of word embeddings as an alternative to n-gram matching for capturing word semantic similarity. Metrics like BLEU2VEC <ref type="bibr" target="#b8">[9]</ref>, BERT SCORE <ref type="bibr" target="#b9">[10]</ref>, and COMET <ref type="bibr" target="#b4">[5]</ref> create alignments between reference and hypothesis segments in an embedding space to compute a score reflecting semantic similarity. COMET, a notable metric in this domain, has demonstrated remarkable results for translation evaluation. However, to train these models, the availability of word embeddings for low-resource languages remains a significant challenge.</p><p>However, these metrics may still need to catch up in capturing the full range of nuances captured by human judgments. Challenges with existing metrics include their reliance on reference texts for comparison, requiring semantic exactness at the word level, susceptibility to differences in lexical structure (such as word order), and the tendency to measure semantic relatedness rather than semantic similarity, huge data requirement for training models thus not well-suited for low-resource languages.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Preliminaries</head><p>This section lays out the foundation required for our experiment design.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Back Translation:</head><p>Back translation is a process where a translated text is translated back into the original language (source language) by a different translator <ref type="bibr" target="#b10">[11]</ref>. In Figure <ref type="figure" target="#fig_0">1</ref>, translation and back-translation processes between English and French are illustrated, as depicted by <ref type="bibr" target="#b11">[12]</ref>. Back translation is recommended in the domains where the content subjected to translation is too sensitive and needs to be double-checked. The back-translation method is widely used in medical research and clinical trials, as it is required by Ethics Committees and regulatory authorities in several countries <ref type="bibr" target="#b0">[1]</ref>. This allows us to compare the back-translated text with the original text to evaluate the quality of the translation.</p><p>The rationale behind using back-translation is that for sensitive documents in the medical domain, back-translation is a recommended practice to cross-verify that the translation adheres to the intended meaning. Usually, back-translation is mandatory in case of quality assessment of medical consent forms, so this is not an overhead in this particular scenario and is generally recommended for medical, legal, market research, and government agencies working in public health, safety, and legal matters. We are utilizing this for translation evaluation. We aim to address the specific needs of these domains to ensure the faithfulness of the translated texts. Our efforts are to use already available back-translation texts for the translation evaluation tasks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Resource Description Framework</head><p>The Resource Description Framework (RDF) is a W3C standard for data representation on the Web. RDF provides a foundation for encoding information in a structured way for the Semantic Web <ref type="bibr" target="#b12">[13]</ref>. It is particularly useful for representing knowledge about entities and the relationships between them.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.1.">Components of RDF</head><p>RDF consists of triplets, which are fundamental units of information. These triplets, also known as RDF triples, form the building blocks for representing knowledge within an RDF graph. Each RDF triple is composed of three elements:</p><p>1. Subject: The resource (entity) being described. (e.g., "The patient") 2. Predicate: The property or characteristic of the subject, denoted by directed arrows. (e.g., "has diagnosisof") 3. Object: The value associated with the predicate for the subject. (e.g., "pneumonia") In Figure <ref type="figure" target="#fig_1">2</ref>, the RDF triple depicts a statement about a patient having a diagnosis of pneumonia. In the context of our research, we leverage RDF to capture the semantics of the sentences, enabling a more nuanced evaluation of translation quality compared to traditional metrics.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.2.">FRED RDF Graphs</head><p>Our research is based on RDF graphs provided by FRED (Framework for RDF-based Extraction and Disambiguation) <ref type="bibr" target="#b13">[14]</ref> to capture semantic nuances in translated texts. At its core, FRED leverages the Resource Description Framework (RDF) to construct semantic graphs that capture the relationships and entities present in the text. FRED bridges the gap between unstructured text and structured knowledge representation, employing Semantic Web technologies to extract and disambiguate information from textual data. Figure <ref type="figure" target="#fig_2">3</ref> shows the RDF graph for the sentence "An experimental drug is one which has not been approved by FDA. ". </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Experiment Design</head><p>We conduct a comparative experiment to evaluate the efficacy of our proposed RDF-based evaluation metric, GATE, in comparison to the baseline metric BLEU and its correlation with human judgment. To obtain baseline BLEU scores, we are using iBLEU <ref type="bibr" target="#b14">[15]</ref>. The evaluation procedure, outlined in Algorithm 1, explains the comparison of RDF graphs generated through the FRED API, which can be accessed at http://wit.istc.cnr.it/stlab-tools/fred/demo/.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Dataset</head><p>Our experiments were done on the selected medical consent forms and the sentences from Semantic Textual Similarity (STS) Benchmark Dataset <ref type="bibr" target="#b15">[16]</ref> to evaluate the effectiveness of GATE in capturing semantic similarity compared to BLEU. The medical consent forms dataset has around 250 original sentence, their corresponding translations, and the back-translated texts, all provided by human translators. Due to the selected availability of medical data, we augmented our analysis with the STS benchmark dataset. In total, our experiments were conducted on 500 sentence pairs, with 250 pairs sourced from medical consent forms provided by a medical institute.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Graph comparison and GATE Score</head><p>We are comparing the source sentence with the back-translated text by constructing RDF graphs for both. The distance between graphs is measured as the Jaccard similarity coefficient <ref type="bibr" target="#b16">[17]</ref> between the entities in the graphs. This way, the distance between the source and the back-translated sentence graph is normalized between 0 and 1, where 1 denotes an exact match, and 0 denotes no similarity. Algorithm 1 outlines the steps in the evaluation process. Specifically, for source sentence s 𝑘 , and the back-translated text b 𝑘 , the GATE Score is calculated as follows: b 𝑘 ← back-translation of t 𝑘 (either already available or obtained using Google Translate)</p><formula xml:id="formula_0">𝐺 𝑘 = 𝑒𝑛𝑡𝑖𝑡𝑖𝑒𝑠(s 𝑘 ) ∩ 𝑒𝑛𝑡𝑖𝑡𝑖𝑒𝑠(b 𝑘 ) 𝑒𝑛𝑡𝑖𝑡𝑖𝑒𝑠(s 𝑘 ) ∪ 𝑒𝑛𝑡𝑖𝑡𝑖𝑒𝑠(b 𝑘 )</formula><p>𝑒𝑛𝑡𝑖𝑡𝑖𝑒𝑠(s 𝑘 ) ← RDF graph nodes of s 𝑘 using FRED G 𝑘 ← 𝑐𝑜𝑚𝑚𝑜𝑛 𝑢𝑛𝑖𝑠𝑜𝑛 11:</p><p>12: end for</p><p>In the next section, we present the findings of our experiments along with a discussion of the insights gained from our research efforts while also addressing the current limitations of our metric.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Results &amp; Discussion</head><p>Our experiment implemented the proposed GATE metric alongside the baseline metric, BLEU. We calculated the Pearson correlation between the BLEU score and GATE score against human judgment on the experiment dataset. Our results in Table <ref type="table" target="#tab_0">1</ref>, show that GATE achieves a significantly higher correlation with human judgment in translation evaluation tasks compared to the widely used metric, BLEU. Specifically, GATE exhibits a ~70% improvement in correlation on the experiment data, with a Pearson correlation coefficient of 0.357 compared to BLEU's 0.200. The higher correlation underscores the effectiveness of leveraging RDF graphs in capturing semantic information, thereby improvement in correlation with human judgments. Table <ref type="table" target="#tab_1">2</ref> shows examples with corresponding human evaluation scores, GATE scores, and BLEU scores. These examples serve to highlight GATE's capability to better reflect human perception of semantic similarity, as evidenced by its closer alignment with human judgments compared to BLEU scores. In summary, our findings indicate that integrating RDF graphs with already existing back-translated texts holds promise for reference-free translation evaluation. This metric can potentially assist human evaluators who evaluate the translation of sensitive documents using back-translated texts.</p><p>Using RDF for translation evaluation could be helpful as they 'encode' real-world semantics akin to how embeddings work in neural network frameworks (such as COMET), contrasting with metrics that are based on lexical level information for translation evaluation (such as BLEU). This work has the potential to pave the way for utilizing knowledge graphs in the field of translation evaluation alongside existing resources, such as word embeddings and LLM-based frameworks. Our experiments reinforce our belief, demonstrating that using knowledge graphs to encode meaning is helpful and gets better results than the baseline metrics. A woman peels a potato A woman is peeling a potato. 1.00 1.00 0.52</p><p>Given that RDF is currently available only in English and our metric compares graphs of original and back-translated texts for translation evaluation, our metric is presently only applicable where English is the source language. However, the target language can be any other language as long as back-translation is available.</p><p>While our results do not surpass state-of-the-art performance, they serve as a proof-of-concept, showcasing the effectiveness of leveraging RDF graphs for translation evaluation tasks. As FRED accommodates large sentences as well, our future work will involve working with more extensive real-world translated medical data and testing our methodology on larger sentences to demonstrate its effectiveness comprehensively. These results underscore the advantages of GATE over traditional metrics like BLEU and motivate further validation of GATE's applicability on real-world data particularly in domains like medicine, along with continuing our exploration for further improvement of the metric.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Conclusion</head><p>In this paper, we introduce GATE, a novel metric based on the Resource Description Framework (RDF) designed for assessing the quality of translated medical texts for which back-translation is available. To showcase the effectiveness of our metric, we conducted experiments using selected medical data and the STS benchmark dataset, comparing the results against the baseline metric, BLEU, and human judgment scores. Notably, GATE exhibits a stronger correlation with human judgment than BLEU, achieving a higher Pearson correlation coefficient (0.357 compared to BLEU's 0.200), representing approximately a ~70% improvement over BLEU, the most commonly used metric.</p><p>By leveraging back-translation and using RDF graphs to encode both semantic and syntactical information, GATE provides a reference-less and semantically aware assessment of translation quality. In comparison with the more advanced Large Language Model (LLM)-based metrics such as COMET, our metric is computationally much lighter. It works for any target language, including low-resource languages, and does not require any data training. Our research shows that, in the field of translation evaluation, existing resources like back-translation and Resource Description Framework could be helpful in real-world scenarios such as the medical domain.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Future Directions</head><p>As part of future work, we would like to explore:</p><p>1. Conducting further experiments to validate the efficacy of GATE on real-world translated medical data. 2. Since Translation and Summarization can both be viewed as natural language generation from a textual context, we aim to explore knowledge graphs such as RDF in the area of evaluating summarization or similar natural language generation tasks. Investigate the utilization of knowledge graphs for tasks beyond translation evaluation, such as summarization. 3. For calculating GATE score, experimenting with different formulas incorporating variations in weights of entities, incoming edges, and outgoing edges.</p><p>4. Addressing the challenge of language dependency in GATE by incorporating multilingual knowledge graphs since FRED works only with English texts. A primary avenue for future work, will be looking into the inclusion of other knowledge graphs available in other languages, making GATE language independent. 5. Development of a software similar to iBLEU for integrating FRED API to facilitate automatic scoring of source and back-translated texts, enhanced visualization, and accessibility of the RDF metric. 6. <ref type="bibr" target="#b17">[18]</ref> shows that back-translation could be useful for improving the translation quality for lowresource languages. Our future work is to combine neural networks with back-translation and knowledge graphs in the area of translation evaluation for low-resource languages. Our future work aims to combine these technologies along with knowledge graphs (such as Knowledge Graph Embeddings) to improve our metric, making it suitable for evaluating translated sensitive texts and investigating the potential of combining neural networks with back-translation and knowledge graphs to improve translation quality, particularly for low-resource languages.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Example of Back Translation (best viewed in color)</figDesc><graphic coords="4,72.00,65.61,451.28,147.40" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: RDF Triple for the sentence "The patient has diagnosis of pneumonia"</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: FRED RDF graph for "An experimental drug is one which has not been approved by FDA. " taken from a medical consent form.</figDesc><graphic coords="5,72.01,248.82,451.26,170.07" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 4 :Algorithm 1 :</head><label>41</label><figDesc>Figure 4: Graph Comparison for measuring semantic similarity. Common nodes are highlighted in multiple colors. In these two graphs there are 8 common nodes, and total unique nodes are 15. (best viewed in color)</figDesc><graphic coords="6,72.01,225.87,451.26,425.20" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>5 :: 7 : 8 :</head><label>578</label><figDesc>𝑒𝑛𝑡𝑖𝑡𝑖𝑒𝑠(b 𝑘 ) ← RDF graph nodes of b 𝑘 using FRED 6common ← {𝑥 | 𝑥 ∈ 𝑒𝑛𝑡𝑖𝑡𝑖𝑒𝑠(s 𝑘 ) and 𝑥 ∈ 𝑒𝑛𝑡𝑖𝑡𝑖𝑒𝑠(b 𝑘 )} unison ← {𝑥 | 𝑥 ∈ 𝑒𝑛𝑡𝑖𝑡𝑖𝑒𝑠(s 𝑘 ) or 𝑥 ∈ 𝑒𝑛𝑡𝑖𝑡𝑖𝑒𝑠(b 𝑘 )}</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>System-wide Pearson correlation of BLEU and GATE with human judgments on MCFs Data and STS Benchmark</figDesc><table><row><cell>Dataset</cell><cell></cell></row><row><cell cols="2">Metric Pearson Correlation</cell></row><row><cell>BLEU</cell><cell>0.200</cell></row><row><cell>GATE</cell><cell>0.357</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2</head><label>2</label><figDesc>GATE vs. BLEU score against human evaluation. Selected examples from the experiment run on STS dataset. Higher correlation with human judgment are marked in bold.</figDesc><table><row><cell cols="2">Serial Hypothesis</cell><cell>Reference</cell><cell cols="3">Human GATE BLEU</cell></row><row><cell>1.</cell><cell cols="2">A man is erasing a chalk board The man is erasing the chalk board</cell><cell>1.00</cell><cell>0.65</cell><cell>0.60</cell></row><row><cell>2.</cell><cell>Three men are playing guitars</cell><cell>Three men on stage are playing guitars</cell><cell>0.75</cell><cell>0.45</cell><cell>0.60</cell></row><row><cell>3.</cell><cell>A woman is carrying a boy</cell><cell>A woman is carrying her baby</cell><cell>0.47</cell><cell>0.53</cell><cell>0.63</cell></row><row><cell>4.</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row></table></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>Our sincere gratitude to the late Prof. Ravi Kothari, on whose suggestions this research work was started. We thank anonymous reviewers for their time and valuable suggestions for improving the paper. We also express our gratitude to Supriya Ranjan, Bhavesh Neekhra, Mamatha Alugubelly, and others for their invaluable feedback and support.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title level="m" type="main">Back translation for quality control of informed consent forms</title>
		<author>
			<persName><forename type="first">D</forename><surname>Grunwald</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Goldfarb</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2006">2006</date>
			<biblScope unit="volume">2</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Translation and back-translation in qualitative nursing research: methodological review</title>
		<author>
			<persName><forename type="first">H.-Y</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">R</forename><surname>Boore</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Clinical Nursing</title>
		<imprint>
			<biblScope unit="volume">19</biblScope>
			<biblScope unit="page" from="234" to="239" />
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">BLEU: a method for automatic evaluation of machine translation</title>
		<author>
			<persName><forename type="first">K</forename><surname>Papineni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Roukos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Ward</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W.-J</forename><surname>Zhu</surname></persName>
		</author>
		<idno type="DOI">10.3115/1073083.1073135</idno>
	</analytic>
	<monogr>
		<title level="j">Association for Computational Linguistics</title>
		<imprint>
			<biblScope unit="page" from="311" to="318" />
			<date type="published" when="2002">2002</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">METEOR: An automatic metric for MT evaluation with improved correlation with human judgments</title>
		<author>
			<persName><forename type="first">S</forename><surname>Banerjee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Lavie</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/W05-0909" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Association for Computational Linguistics</title>
				<meeting>the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2005">2005</date>
			<biblScope unit="page" from="65" to="72" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">COMET-22: Unbabel-IST 2022 submission for the metrics shared task</title>
		<author>
			<persName><forename type="first">R</forename><surname>Rei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">G C</forename><surname>De Souza</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Alves</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Zerva</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">C</forename><surname>Farinha</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Glushkova</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Lavie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Coheur</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">F T</forename><surname>Martins</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/2022.wmt-1.52" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Seventh Conference on Machine Translation (WMT), Association for Computational Linguistics</title>
				<meeting>the Seventh Conference on Machine Translation (WMT), Association for Computational Linguistics<address><addrLine>Abu Dhabi, United Arab Emirates; Hybrid</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="578" to="585" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Automatic evaluation of machine translation quality using n-gram co-occurrence statistics</title>
		<author>
			<persName><forename type="first">G</forename><surname>Doddington</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the second international conference on Human Language Technology Research</title>
				<meeting>the second international conference on Human Language Technology Research</meeting>
		<imprint>
			<date type="published" when="2002">2002</date>
			<biblScope unit="page" from="138" to="145" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">A study of translation edit rate with targeted human annotation</title>
		<author>
			<persName><forename type="first">M</forename><surname>Snover</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Dorr</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Schwartz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Micciulla</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Makhoul</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers</title>
				<meeting>the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers</meeting>
		<imprint>
			<date type="published" when="2006">2006</date>
			<biblScope unit="page" from="223" to="231" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<title level="m" type="main">Quality estimation for machine translation</title>
		<author>
			<persName><forename type="first">L</forename><surname>Specia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Scarton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">H</forename><surname>Paetzold</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Hirst</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2018">2018</date>
			<publisher>Springer</publisher>
			<biblScope unit="volume">11</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">bleu2vec: the painfully familiar metric on continuous vector space steroids</title>
		<author>
			<persName><forename type="first">A</forename><surname>Tättar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Fishel</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/W17-4771</idno>
		<ptr target="https://aclanthology.org/W17-4771.doi:10.18653/v1/W17-4771" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Second Conference on Machine Translation, Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">O</forename><surname>Bojar</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">C</forename><surname>Buck</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">R</forename><surname>Chatterjee</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">C</forename><surname>Federmann</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Y</forename><surname>Graham</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">B</forename><surname>Haddow</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Huck</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><forename type="middle">J</forename><surname>Yepes</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">P</forename><surname>Koehn</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Kreutzer</surname></persName>
		</editor>
		<meeting>the Second Conference on Machine Translation, Association for Computational Linguistics<address><addrLine>Copenhagen, Denmark</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="619" to="622" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<author>
			<persName><forename type="first">T</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Kishore</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">Q</forename><surname>Weinberger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Artzi</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1904.09675</idno>
		<title level="m">Bertscore: Evaluating text generation with bert</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<title level="m" type="main">What is back translation?</title>
		<author>
			<persName><forename type="first">A</forename></persName>
		</author>
		<ptr target="https://gtelocalize.com/what-is-back-translation/" />
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">H</forename><surname>Trinh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Le</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Hoang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Luong</surname></persName>
		</author>
		<ptr target="https://github.com/vietai/dab" />
		<title level="m">A tutorial on data augmentation by backtranslation</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<ptr target="https://www.w3.org/TR/PR-rdf-syntax/" />
		<title level="m">World Wide Web Consortium, Resource description framework (rdf) syntax specification (revised)</title>
				<imprint>
			<date type="published" when="1998">1998</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Semantic Web Machine Reading with FRED</title>
		<author>
			<persName><forename type="first">A</forename><surname>Gangemi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Presutti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">R</forename><surname>Recupero</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">G</forename><surname>Nuzzolese</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Draicchio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Mongiovã¬</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Semantic Web</title>
		<imprint>
			<biblScope unit="volume">8</biblScope>
			<biblScope unit="page" from="873" to="893" />
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">ibleu: Interactively debugging and scoring statistical machine translation systems</title>
		<author>
			<persName><forename type="first">N</forename><surname>Madnani</surname></persName>
		</author>
		<idno type="DOI">10.1109/ICSC.2011.36</idno>
	</analytic>
	<monogr>
		<title level="m">IEEE Fifth International Conference on Semantic Computing</title>
				<imprint>
			<date type="published" when="2011">2011. 2011</date>
			<biblScope unit="page" from="213" to="214" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Semeval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation</title>
		<author>
			<persName><forename type="first">D</forename><surname>Cer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Diab</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Agirre</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Lopez-Gazpio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Specia</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/s17-2001</idno>
		<ptr target="http://dx.doi.org/10.18653/v1/S17-2001.doi:10.18653/v1/s17-2001" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Association for Computational Linguistics</title>
				<meeting>the 11th International Workshop on Semantic Evaluation (SemEval-2017), Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<ptr target="https://en.wikipedia.org/wiki/Jaccard_index" />
		<title level="m">Wikipedia contributors</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note>Jaccard similarity</note>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Enhanced back-translation for low resource neural machine translation using self-training</title>
		<author>
			<persName><forename type="first">I</forename><surname>Abdulmumin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">S</forename><surname>Galadanci</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Isa</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Information and Communication Technology and Applications</title>
				<editor>
			<persName><forename type="first">S</forename><surname>Misra</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">B</forename><surname>Muhammad-Bello</surname></persName>
		</editor>
		<meeting><address><addrLine>Cham</addrLine></address></meeting>
		<imprint>
			<publisher>Springer International Publishing</publisher>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="355" to="371" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
