<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>\Alignment is All You Need": Analyzing Cross-Lingual Text Similarity for Domain-Speci c Applications</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Huawei Research Centre</institution>
          ,
          <addr-line>Dublin</addr-line>
          ,
          <country country="IE">Ireland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Cross-lingual text similarity provides an important measure to adjudge the contextual and semantic similarity between documents across di erent languages. Extraction of similar or aligned multi-lingual texts would enable e cient approaches for information retrieval and natural language processing applications. However, diversity of linguistic constructs coupled with domain speci city and low resources pose a signi cant challenge. In this paper, we present a study analyzing the performance of di erent existing approaches, and show that Word Mover's Distance on aligned language embedding provides a reliable and coste ective cross-lingual text similarity measure to tackle evolving domain information, even when compared to advanced machine learning models.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Motivation. The explosion of openly available data on the World Wide Web
provides a wealth of mono-lingual text information in terms of documents, news
articles, and blogs to name a few. However, with the growth of application domains
catering to diverse geographic regions and target user groups, understanding of
cross-lingual information has become an interesting area of research. As such,
estimating the semantic similarity between documents, transcending the language
barrier, plays an important role in information retrieval for applications like news
aggregation [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ], document summarization, and question answering [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Further,
extraction of parallel or comparable corpora assumes a crucial role in a variety of
natural language processing (NLP) tasks like machine translation [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], word-level
lexicon similarities [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ], and in learning large multi-lingual language models like
BERT [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and XLM [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Generalization of the above can lead to enhancements in
diverse downstream language-agnostic NLP applications like document clustering,
information fusion, translation [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ], or parallel corpus generation.
Challenges. Consider an organization to expand its o erings to di erent global
markets, naturally encompassing a multi-lingual target customer base. Scaling of
operations for its automated services like Chatbots [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and retrieval systems [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ],
would require \multi-lingual capabilities" to cater to the new or emerging markets.
In such scenarios, semantic relationships among such cross-lingual diverse data
need to be e ciently computed for ease of information aggregation, and for
business analytics based on geographical trends. Further, in terms of
eventcentric information, like election reports or natural calamities, the rapid evolution
of data over time, across di erent sources, as well as in di erent languages, posing
a challenge for obtaining a global understanding of the event evolution, possible
interlinkings, and information propagation. Current approaches tend to rely on
large pre-trained language models for computing textual semantic similarity, and
also for \zero-shot learning" capabilities [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ] to automatically transfer knowledge
between languages { enabling decent multi-lingual document understanding and
retrieval performance.
      </p>
      <p>
        However, in both the above scenarios, existing techniques su er from the
presence of (i) domain-speci city, (ii) limited training data, and (iii) evolving
information. Speci cally, pre-trained language models might fail to identify
contextual relationships in domain-speci c applications (in terms of the operational
domain of the enterprise), and would be di cult to train on evolving
information (having limited training data). Further, language models like T5 [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] and
GPT-3 [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] are extremely expensive and resource-intensive to train, maintain, and
use in practice { potentially limiting its application to large organizations only.
Other learning techniques might also be ine ective due to the presence of limited
training resources (in terms of annotated data), and in general fail to su ciently
generalize to morphologically rich and low-resourced languages.
      </p>
      <p>
        Contributions. In this paper, we study the problem of multi-lingual text
alignment and explore the e cacy of unsupervised strategies for accurately capturing
semantic similarity in cross-lingual domain-speci c contents. To this end, we show
that the Word Mover's Distance [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] measure applied on aligned vector space
vocabulary embedding across languages is quite e ective in terms of accuracy,
comparable to state-of-the-art large language model architectures. Experiments
on various domains and languages showcase the above strategy to be reliable
in handling not only domain-speci c data, but also morphologically rich and
low-resourced languages { providing a generalizable, cost-e ective, multi-lingual
text similarity measure for information retrieval, aggregation and fusion.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Understanding Cross-Lingual Text Similarity</title>
      <p>
        Initial approaches for cross-lingual text similarity relied on di erences across
documents based on descriptor terms from multi-lingual thesaurus [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ]. However,
to reduce the dependency on expensive manually created parallel dictionaries,
translation-based text similarity approaches using lexical and syntactic features
along with overlap of synonymous words from resources like WordNet [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] were
proposed [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ]. With automated translation techniques and growth of parallel data
availability, machine learning models were used to detect semantically similar
documents [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ], which were further extended to zero-shot environments via
transfer learning from pre-trained language models [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. We next brie y discuss
possible state-of-the-art strategies to measure text similarity.
(A) Aligned Semantic Distance. The success of distributional word
representations like Word2Vec [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] and FastText [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] in capturing word meanings has been
established in a wide range of NLP tasks. Document embedding techniques like
doc2vec [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] with cosine distance have been traditionally used to e ectively
capture semantic similarity between texts. The Word Mover's Distance (WMD) [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]
provides a far more e ective unsupervised distance metric by formulation as an
optimal transport problem based on the Earth Mover's or Wasserstein distance.
Mathematically, given two distributions X and Y (set of word embeddings),
WMD computes the minimum e ort of transforming one distribution to the
other by solving min Pij ij Cij , where ij is the amount of transformation
required and Cij is the associated distance between points i and j across the two
distributions (or documents). However, in multi-lingual settings, the distributed
word vector representations for the di erent languages are created in potentially
di erent embedding spaces, as the monolingual embeddings are learnt in a relative
fashion, and thus might have di erent orientations and degrees of freedom [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. As
such, direct application of the original WMD formulation in this setting would
not be appropriate.
      </p>
      <p>
        Cross-lingual word embedding alignment entails mapping the vocabularies
of the di erent languages onto a single vector space to capture syntactic and
semantic similarity of words across language boundaries [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Leveraging the
similarity of geometric properties between di erent monolingual word embedding
spaces [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], several supervised and unsupervised frameworks employing adversarial
learning coupled with re nement strategies were proposed such as MUSE [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ],
VecMap [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], and RCSLS [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. In general, to align the di erent independently
learnt monolingual word embeddings (for the di erent languages) onto a common
vector representation space, the above approaches tend to learn a transformation
T : X ! Y between the two language embeddings. Mathematically this involves
the optimization of minT jjX T (Y)jj2F , where jj jjF is the Frobenius norm. This,
when constrained to orthonormal matrices, results in the closed-form orthogonal
Procrustes problem [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ], used in literature as a re nement strategy.
      </p>
      <p>
        Once the vector spaces are aligned onto a common representation domain,
the WMD formulation can now be directly applied on this new shared space, as
presented in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. We refer to this approach as Word Mover's Distance - Aligned
(WMD-A), while the use of the original non-aligned embeddings for WMD is
denoted as WMD-NA. Alternatively, a nave approach would be to obtain text
representation via averaged aligned word embeddings, and using cosine distance
between the contents as a measure of similarity. However, use of such weighting
strategies tend to depict lower accuracy [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ], and are hence not considered as
baselines in our analysis.
(B) Multi-Domain Wasserstein Distance. Since Wasserstein distance is not
compatible for comparing distances across di erent geometric domain spaces
(as mentioned above), Gromov-Wasserstein (GW) distance [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] was proposed.
To generalize between di erent domains, the GW distance takes into account
the metric spaces and compares the distance between points in one domain
space to the distances of points in the other space, i.e., nds the optimal using
min Cij;kl ik jl, where speci es the transformation between the pairs of
points from the two spaces (i; j and k; l) and C is the di erence in the distances
between the point pairs within the individual domains. In this scenario, WMD
coupled with GW distance (henceforth referred to as WMD-GW), provides a
viable option for computing multi-lingual document semantic similarity based on
(non-aligned) word embeddings across di erent languages.
(C) Pre-trained Language Models. Contextual language models (CLM) like
BERT [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] take into account the context of a word occurrence to provide \dynamic"
word embeddings, where the same word in di erent contexts is represented by
di erent vector embeddings, capturing possibly di erent meanings based on the
usage. A natural extension to multi-lingual settings were explored, and CLMs like
M-BERT [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and XLM-R [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] were developed using the transformer architecture
with multi-lingual parallel corpus and shared vocabulary. The presence of
opensource data enabled the creation of huge pre-trained language models from large
repositories of Wikipedia pages. These language models were shown to be adept
at several NLP tasks like question answering [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], text summarization, document
similarity, text generation [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], and zero-shot transfer learning, with \near
humanlevel" language \understanding" in certain scenarios [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Thus, text embeddings
from multi-lingual CLMs like M-BERT coupled with cosine similarity measure
are usually used for computing cross-lingual document similarity.
      </p>
      <p>
        Larger language models were shown to be better for downstream NLP tasks,
leading to enormous models like T5 [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] and GPT-3 [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] with 11B and 175B
parameters respectively. Although, their performance in text similarity, comprehension,
translation, and zero-shot transfer learning were astounding, these models require
high-end compute resources for training (re-training to capture relationships
in domain speci c and evolving data is not practical), and are susceptible to
low-resourced languages.
(D) Sentence Embedding. Recent approaches like dual-encoder based
Universal Sentence Encoder [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] or Siamese network based SBERT [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] for generating
contextualized sentence embeddings involve a layer of deep learning architecture
atop the pre-trained contextualized language models, providing multi-lingual
sentence embeddings. Speci cally, the sentence transformer architectures of
multilingual SBERT uses the teacher-student knowledge distillation framework coupled
with ne-tuned language model to generate e ective multi-lingual text vector
representations. Such models have been shown to outperform sentence
embeddings, obtained directly from the CLMs, for text similarity tasks. Multi-lingual
SBERT with cosine similarity measure is considered as a strong state-of-the-art
technique for capturing semantic similarity across multi-lingual short texts.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Experimental Analysis</title>
      <p>We analyze the performance of the above techniques for computing multi-lingual
text semantic similarity, in presence of domain-speci city and linguistic diversity.
Baselines. We analyze the following existing approaches in cross-lingual settings:
(i) WMD-NA { Word Mover's Distance on the independently learnt monolingual
embeddings (without any alignment) obtained from FastText (fasttext.cc);
(ii) WMD-GW { Word Mover's Distance coupled with Gromov-Wasserstein
distance, implemented using Python Optimal Transport library (pythonot.github.io);
(iii) WMD-A { Word Mover's Distance on aligned FastText word embeddings
(alignment for Xhosa to English obtained from VecMap (github.com/artetxem/vecmap);
(iv) M-BERT { token embeddings from pre-trained multi-lingual BERT language
model (using github.com/hanxiao/bert-as-service) is used to compute cosine
similarity between texts; and,
(v) SBERT { text embeddings obtained from multi-lingual sentence transformer
based teacher-student architecture (github.com/UKPLab/sentence-transformers) is
used with cosine similarity for content similarity.</p>
      <p>Dataset. We use document across di erent languages and diverse domains from
OPUS, the open parallel corpus, obtained from opus.nlpl.eu. Speci cally, we use
sentence translation pairs from the following six domain-speci c collections as:
(i) EMEA (Medical) { a parallel corpus from the European Medicines Agency;
(ii) JRC-Acquis (Judicial) { a collection of legislative text of the European Union;
(iii) Bible-uedin (Religious) { created from the translations of the Holy Bible;
(iv) MultiUN (Legislative) { collection of documents from the United Nations;
(v) TedTalks (Generic) { corpus of transcribed and translated TED talks; and,
(vi) XhosaNavy (Maritime) { contains maritime texts from South African Navy.</p>
      <p>
        We also considered six di erent languages (including morphologically rich and
low-resourced), namely German (de), Finnish ( ), Romanian (ro), Russian (ru),
Croatian (hr), and Xhosa (xh) { containing a diverse combination of isolating,
fusional and agglutinative language with dependent and mixed marking [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ]. For
each of the above datasets, 5K parallel sentences across each language pair were
randomly sampled to form the dataset.
      </p>
      <p>Task. Given a language pair (X; Y ), for each input sentence in X the
corresponding translation in Y is extracted by using cosine similarity computation.
For example, given a sentence xi in language X, its similarity is computed to
sentences yj (for 8j) of language Y , and the one with the maximum cosine
similarity score is reported as the translation of xi by the algorithm.
Evaluation Measure. We evaluate the accuracy of the competing algorithms
using the Precision-at-Rank-1 (P@1) and Precision-at-Rank-5 (P@5) measures.
This reports the percentage of times the ground-truth translations between the
input sentences (across language pairs) are present in the extracted top-1 and
top-5 results, respectively. Note, equivalent sentence translations should represent
the highest semantic similarity, and should be reported as top-1.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Experimental Observations</title>
      <p>Table 1 tabulates the multi-lingual sentence semantic similarity accuracy scores
as obtained by the di erent approaches across the languages. We observe that
for high-resource languages like German, SBERT performs the best, showcasing
robustness to domain-speci city, while WMD-A provides comparable results.
However, on morphologically rich and low-resource languages like Finnish and
Romanian, WMD-A is seen to outperform the other methods, in almost all
the domains. Intuitively, the presence of su cient training resource for
welldocumented languages provides enhanced performance for supervised methods
like SBERT. But, for morphologically rich and low-resourced languages training
of language models is challenged by linguistic diversity and scarcity of resources.</p>
      <p>To further explore the robustness of the algorithms, Table 2(a) considers a
more challenging setting of domain-speci city coupled with distant languages.
We observe that in this case, unsupervised WMD-A based on aligned vector
embeddings consistently outperforms the other techniques in nearly all the
scenarios. As expected, WMD on non-aligned embedding space (WMD-NA)
performs the worst, as the optimal transportation is not theoretically geared for
comparison across di erent domains. The use of Gromov-Wasserstein distance
brings a healthy improvement in WMD, however falls signi cantly short compared
to the other algorithms; possibly due to the presence of limited context. However,
it is interesting to note that M-BERT fails to perform well, possibly due to its
lack of robustness to domain-speci city for generating embeddings.</p>
      <p>The above observations are applicable when the language pairs are coupled
with English, however in certain scenarios semantic similarity in texts across
other language pairs might be necessary. For completeness, we compare the
performance of the methodologies with di erent language pair bases as shown
in Table 2(b). We observe that in these scenarios, SBERT performs the best
while WMD-A showcases comparable results. As most cross-lingual vocabulary
alignment techniques consider English as the base shared embedded space, the
dependency of WMD-A on English is portrayed in this analysis. In general, we
nd that WMD-A provides an e ective method for textual semantic similarity,
across diverse domains and languages.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>In this paper, we compare and analyze the performance of state-of-the-art
approaches for text similarity in the face of domain-speci city and diverse linguistic
variations. We observed that Word Mover's Distance based on aligned vector
space embedding provides an e cient and unsupervised technique for
computing cross-lingual textual similarity. It is robust to domain-speci c data even on
morphologically rich and distant languages, and might be easily applicable to
evolving event-centric information (without the need of any training process).
Overall WMD-A provides an e ective method for textual semantic similarity,
comparable to state-of-the-art advanced machine learning methods and language
models { depicting \alignment is all you need".</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Abbet</surname>
          </string-name>
          , C.e.a.:
          <article-title>Churn Intent Detection in Multilingual Chatbot Conversations and Social Media</article-title>
          . In: CoNLL. pp.
          <volume>161</volume>
          {
          <issue>170</issue>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Alvarez-Melis</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jaakkola</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Gromov-Wasserstein Alignment of Word Embedding Spaces</article-title>
          . In: EMNLP. pp.
          <year>1881</year>
          {
          <year>1890</year>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Artetxe</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Labaka</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Agirre</surname>
          </string-name>
          , E.:
          <article-title>A Robust Self-learning Method for Fully Unsupervised Cross-lingual Mappings of Word Embeddings</article-title>
          . In: ACL. pp.
          <volume>789</volume>
          {
          <issue>798</issue>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Balikas</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Laclau</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Redko</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Amini</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Cross-Lingual Document Retrieval Using Regularized Wasserstein Distance</article-title>
          . In: ECIR. pp.
          <volume>398</volume>
          {
          <issue>410</issue>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Bojanowski</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grave</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joulin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Enriching Word Vectors with Subword Information</article-title>
          .
          <source>Transactions of the Association for Computational Linguistics</source>
          <volume>5</volume>
          ,
          <issue>135</issue>
          {
          <fpage>146</fpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Brown</surname>
          </string-name>
          , T.B.e.a.:
          <string-name>
            <surname>Language Models are Few-Shot Learners</surname>
          </string-name>
          (
          <year>2020</year>
          ), arXiv:
          <year>2005</year>
          .14165
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Cer</surname>
          </string-name>
          , D.e.a.:
          <source>Universal Sentence Encoder</source>
          (
          <year>2018</year>
          ), arXiv preprint arXiv:
          <year>1803</year>
          .11175
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Conneau</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lample</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ranzato</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Denoyer</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jegou</surname>
          </string-name>
          , H.:
          <article-title>Word Translation Without Parallel Data</article-title>
          .
          <source>In: ICLR</source>
          . pp.
          <volume>1</volume>
          {
          <issue>14</issue>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          : BERT:
          <article-title>Pre-training of Deep Bidirectional Transformers for Language Understanding</article-title>
          .
          <source>In: NAACL-HLT</source>
          . pp.
          <volume>4171</volume>
          {
          <issue>4186</issue>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>He</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>W.:</given-names>
          </string-name>
          <article-title>DeBERTa: Decoding-enhanced BERT with Disentangled Attention</article-title>
          . In: ICLR. pp.
          <volume>1</volume>
          {
          <issue>20</issue>
          (
          <year>2021</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Joulin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bojanowski</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jegou</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grave</surname>
          </string-name>
          , E.:
          <article-title>Loss in Translation: Learning Bilingual Word Mapping with a Retrieval Criterion</article-title>
          . In: EMNLP. pp.
          <volume>2979</volume>
          {
          <issue>2984</issue>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Kusner</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kolkin</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weinberger</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>From Word Embeddings To Document Distances</article-title>
          . In: ICML. pp.
          <volume>957</volume>
          {
          <issue>966</issue>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Lample</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Conneau</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <string-name>
            <surname>Cross-Lingual Language Model Pretraining</surname>
          </string-name>
          (
          <year>2019</year>
          ), arXiv:
          <year>1901</year>
          .07291
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>MacAvaney</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Soldaini</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goharian</surname>
          </string-name>
          , N.:
          <article-title>Teaching a New Dog Old Tricks: Resurrecting Multilingual Retrieval Using Zero-Shot Learning</article-title>
          .
          <source>In: ECIR</source>
          . pp.
          <volume>246</volume>
          {
          <issue>254</issue>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15. Mass,
          <string-name>
            <given-names>Y.</given-names>
            ,
            <surname>Carmeli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Roitman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Konopnicki</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          :
          <article-title>Unsupervised FAQ Retrieval with Question Generation and BERT</article-title>
          . In: ACL. pp.
          <volume>807</volume>
          {
          <issue>812</issue>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Memoli</surname>
          </string-name>
          , F.:
          <article-title>GromovWasserstein Distances and the Metric Approach to Object Matching</article-title>
          .
          <source>Foundations of Computational Mathematics</source>
          <volume>11</volume>
          ,
          <volume>417</volume>
          {
          <fpage>487</fpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Le</surname>
            ,
            <given-names>Q.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Exploiting Similarities among Languages for Machine Translation (</article-title>
          <year>2013</year>
          ), arXiv preprint arXiv:
          <volume>1309</volume>
          .
          <fpage>4168</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Distributed Representations of Words and Phrases and their Compositionality</article-title>
          .
          <source>In: NIPS</source>
          . pp.
          <volume>3111</volume>
          {
          <issue>3119</issue>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Miller</surname>
            ,
            <given-names>G.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Beckwith</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fellbaum</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gross</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miller</surname>
            ,
            <given-names>K.J.</given-names>
          </string-name>
          :
          <article-title>Introduction to WordNet: An On-line Lexical Database</article-title>
          .
          <source>International Journal of Lexicography</source>
          <volume>4</volume>
          (
          <issue>3</issue>
          ),
          <volume>235</volume>
          {
          <fpage>244</fpage>
          (
          <year>1990</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Munteanu</surname>
            ,
            <given-names>D.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marcu</surname>
            ,
            <given-names>D.:</given-names>
          </string-name>
          <article-title>Improving Machine Translation Performance by Exploiting Non-parallel Corpora</article-title>
          .
          <source>Computational Linguistics</source>
          <volume>31</volume>
          (
          <issue>4</issue>
          ),
          <volume>477</volume>
          {
          <fpage>504</fpage>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21. Ra el, C.e.a.:
          <article-title>Exploring the Limits of Transfer Learning with a Uni ed Text-to-Text Transformer</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>21</volume>
          (
          <issue>140</issue>
          ),
          <volume>1</volume>
          {
          <fpage>67</fpage>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Rajpurkar</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , Zhang, J.,
          <string-name>
            <surname>Lopyrev</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liang</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          : SQuAD:
          <volume>100</volume>
          ,000+
          <article-title>Questions for Machine Comprehension of Text</article-title>
          . In: EMNLP. pp.
          <volume>2383</volume>
          {
          <issue>2392</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Rapp</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>Automatic Identi cation of Word Translations from Unrelated English and German Corpora</article-title>
          . In: ACL. pp.
          <volume>519</volume>
          {
          <issue>526</issue>
          (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Reimers</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gurevych</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks</article-title>
          .
          <source>In: EMNLP</source>
          . pp.
          <volume>3982</volume>
          {
          <issue>3992</issue>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25. Ruckle,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Eger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Peyrard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Gurevych</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
          :
          <article-title>Concatenated Power Mean Word Embeddings as Universal Cross-Lingual Sentence Representations (</article-title>
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Rupnik</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Muhic</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leban</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fortuna</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grobelnik</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <string-name>
            <surname>News Across Languages - Cross-Lingual Document</surname>
            Similarity and
            <given-names>Event</given-names>
          </string-name>
          <string-name>
            <surname>Tracking</surname>
          </string-name>
          . In: IJCAI. pp.
          <volume>5050</volume>
          {
          <issue>5054</issue>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Santosh</surname>
            ,
            <given-names>G.S.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kumar</surname>
            ,
            <given-names>N.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Varma</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Ranking Multilingual Documents Using Minimal Language Dependent Resources</article-title>
          . In: CICLing. pp.
          <volume>212</volume>
          {
          <issue>220</issue>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28. Schonemann, P.H.:
          <article-title>A Gen. Sol. of the Ort</article-title>
          .
          <source>Procrustes Prob. Psychometrika</source>
          <volume>31</volume>
          (
          <issue>1</issue>
          ),
          <volume>1</volume>
          {
          <fpage>10</fpage>
          (
          <year>1966</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29. S gaard, A.,
          <string-name>
            <surname>Ruder</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vulic</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>On the Limitations of Unsupervised Bilingual Dictionary Induction</article-title>
          . In: ACL. pp.
          <volume>778</volume>
          {
          <issue>788</issue>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          30.
          <string-name>
            <surname>Steinberger</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pouliquen</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hagman</surname>
          </string-name>
          , J.:
          <article-title>Cross-Lingual Document Similarity Calculation Using the Multilingual Thesaurus EUROVOC</article-title>
          . In: CICLing. pp.
          <volume>415</volume>
          {
          <issue>424</issue>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          31.
          <string-name>
            <surname>Xian</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schiele</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Akata</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>Zero-shot Learning - the Good, the Bad and the Ugly</article-title>
          . In: CVPR. pp.
          <volume>4582</volume>
          {
          <issue>4591</issue>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          32.
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Luan</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Izuha</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hao</surname>
          </string-name>
          , J.:
          <article-title>Building Earth Mover's Distance on Bilingual Word Embeddings for Machine Translation</article-title>
          . In: AAAI. pp.
          <volume>2870</volume>
          {
          <issue>2876</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>