<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>TIAD Shared Task 2019: Orthonormal Explicit Topic Analysis for Translation Inference across Dictionaries</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>John P. McCrae</string-name>
          <email>john@mccr.ae</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Insight Centre for Data Analytics/Data Science Institute National University of Ireland</institution>
          ,
          <addr-line>Galway</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>The task of inferring translations can be achieved by the means of comparable corpora and in this paper we apply explicit topic modelling over comparable corpora to the task of inferring translation candidates. In particular, we use the Orthonormal Explicit Topic Analysis (ONETA) model, which has been shown to be the state-of-the-art explicit topic model through its elimination of correlations between topics. The method proves highly e ective at selecting translations with high precision.</p>
      </abstract>
      <kwd-group>
        <kwd>Topic Modelling Explicit Topics Translation Inference</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Explicit topic modelling, such as proposed by the Explicit Semantic Analysis
(ESA) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] method, is a method that in contrast to latent topic modelling, such
as Latent Dirichlet Allocation (LDA) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], or word embeddings relies on the user
to explicitly give a list of topics. These topics are a set of documents that are
supposed to correspond to the major topical areas of the domain, however in
most works, including this one, a set of Wikipedia articles are chosen as the
explicit topics. This method while obviously requiring more manual e ort than
latent methods, does provide a number of advantages, most notably that the
topics can easily be aligned across languages and this has been implemented
by Cross-lingual Explicit Semantic Analysis (CL-ESA) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. In contrast, latent
methods require a complex and error-prone step of aligning the latent topics
across languages [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. One of the principle criticisms of explicit semantic analysis
is that the choice of underlying implementation can strongly a ect the quality
of the resulting system [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. One of the main reasons for this is the fact that
the topics chosen for the explicit analysis are often highly similar and that this
causes a lack of orthogonality between the topics [
        <xref ref-type="bibr" rid="ref1 ref4">4, 1</xref>
        ]. For this reason we use
the Orthonormal Explicit Topic Analysis (ONETA) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] method in order to nd
cross-lingual equivalents between terms.
      </p>
      <p>
        Translation inference is the task of inferring a translation equivalent between
two languages by means of other bilingual dictionaries in other language pairs.
The principle issue is that the translation graph is not transitive, so by
following a translation pair from English to Spanish, and then a translation pair
from Spanish to French and incorrect translation may be inferred if there are
multiple senses of the Spanish word that is used as a pivot. However, previous
TIAD tasks [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] have shown that this is a moderately high precision method. For
this edition of the task, we proposed ltering the results of pivot translations
by means of inferred cross-lingual similarity using ONETA, with the idea that
translations that are both found by the pivot and ranked as highly-similar by
the ONETA method are likely to be high quality translations. In this way, we
provide a method that allows the lexicographer to easily adjust the method to
a level of precision that is most suitable for validating translation candidates
generated by pivot-based translation.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Orthonormal Explicit Topic Analysis</title>
      <p>Orthonormal explicit topic analysis follows from explicit semantic analysis by
assuming there is a background collection of documents we call B = fb1; : : : ; bng,
and in the cross lingual setting it is assumed that there is a paired set of
documents B0 = fb01; : : : ; b0ng, with each document being paired with a similar
document in a second language. This is most frequently achieved by using Wikipedia,
where interlingual links link two articles in di erent languages. It is assumed that
we have some language-speci c function that maps a document to a vector
in Rn, such that the jth element of the vector is an association with the j (d)
with the document bj . In our method, this vector is given by a metric such as
TF-IDF such that in our score is:
j (d) = tf-idf(b!j)Ttf-idf(!d)
(1)</p>
      <p>If we consider the application of this method to the background corpus we
can construct a matrix X, whose elements are the corresponding TF-IDF values
xwj = tf-idfw(bj ), and hence that i(bj ) is the i; jth element of XTX. One of
the key assumptions is that we should have that topics that are as distinct as
possible in order to reduce the amount of overlap between the topics. This is
achieved by assuming that we have some function sim : B ! [0; 1] that has the
following property:
sim(bi; bj ) =
1 if i = j
0 if i 6= j</p>
      <p>
        This can be though of as maximizing training accuracy as we are ensuring
that the similarity of two di erent topics in our background is zero and the
similarity of the topic with itself is one. In McCrae et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] it was shown that
this can be achieved by the function1:
      </p>
      <p>ONETA(d) = (XTX) 1 (d) = X+tf-idf(!d)
1 X+ denotes the Moore-Penrose pseudo-inverse, which satis es X+X = I
(2)
(3)</p>
      <p>For any choice of (d) where (XTX) 1 exists and it is easy to verify that
Equation 2 holds as:</p>
      <p>I = (XTX) 1XTX = X+X</p>
      <p>In practice the computation of this matrix can be time-consuming so
instead McCrae et al. proposed rearranging the order of the vocabulary in the
background collection to nd a good approximation of the form:</p>
      <p>And as it is easy to verify the following equation based on a matrix of this
form:2</p>
      <p>X '</p>
      <p>A B
0 C
I =</p>
      <p>A+
0</p>
      <p>A+BC+</p>
      <p>C+</p>
      <p>A B
0 C
0ONETA(d) =</p>
      <p>A+
0</p>
      <p>A+BC+</p>
      <p>C+
tf-idf(!d)
(4)
(5)
(6)
(7)</p>
      <p>This leads to a strong approximation of ONETA as follows:
3</p>
    </sec>
    <sec id="sec-3">
      <title>Applying ONETA to dictionary inference</title>
      <sec id="sec-3-1">
        <title>Language 1 Language 2</title>
      </sec>
      <sec id="sec-3-2">
        <title>English English English French</title>
        <p>Spanish
French
Portuguese
Portuguese</p>
      </sec>
      <sec id="sec-3-3">
        <title>Articles</title>
        <p>The key purpose of ONETA is to estimate the similarity between documents,
and to apply it to the task of inferring the similarity of translation, we make the
simple assumption that each term in a translation is a single document consisting
of only the term in question. As such we simply apply the system by building
two ONETA functions for our source language, s, and target language t and
estimate the similarity as:
2 McCrae et al. use the Jacobi preconditioner of C as a further approximation of C+
sim(ws; wt) = cos( sONETA(ws); tONETA(wt))
(8)</p>
        <p>In order to construct pairs, we considered only the simple pivot between
one language, and as for this task the languages were English, French and
Portuguese, there were only two common languages between them namely Spanish
and Catalan, and as such we called our two systems ONETA-ES and
ONETACA based on the pivot language. We simply considered all possible translations
between the two language pairs and then calculated the similarity using the
ONETA score. As we found that the distribution of the scores was strongly
clustered around zero, we used the following function to provide a more even spread
of scores.</p>
        <p>sim0(ws; wt) = jsim(ws; wt)j
(9)</p>
        <p>For our experiments we tuned = 0:3 to provide a reasonable spread of
certainty values. As with previous work we used Wikipedia to construct our
corpora using the interlingual index to create a comparable corpus for each
language pair, the sizes of which are given in Table 1.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <sec id="sec-4-1">
        <title>Threshold Precision Recall 0.0 0.1</title>
        <p>0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.845
0.904
0.902
0.894
0.885
0.884
0.878
0.869
0.866
0.867
0.541
0.237
0.184
0.149
0.119
0.093
0.072
0.053
0.038
0.022</p>
        <p>F1
0.659
0.375
0.306
0.255
0.209
0.168
0.133
0.101
0.072
0.043</p>
        <p>During development we evaluated on the English to Spanish translations
using Catalan as a pivot, as all language pairs are available as part of the training
data and the results are presented in Table 2. It should be noted that at the
threshold value of 0% the system is basically nothing more than pivot translation
and this should be considered a baseline. For higher values of the threshold,
ONETA does improve the precision, however the recall also decreases rapidly
causing the F-Measure to fall overall.
F1</p>
        <p>In the o cial results (Table 3), we see a similar outcome where the highest
F-Measure is achieved at the trivial threshold of 0% and we see strong gains
in precision at the cost of recall. This shows that ONETA can quite e ectively
select translations that are very likely to be correct but misses many translations
even among those that are generated by a pivot method.</p>
        <p>When all systems are compared (Figure 1) at various threshold levels we see
that the ONETA-ES system actually reports the strongest F1 Measure (averaged
over all language pairs) of any system, however it should be noted that this is
a threshold value that we would consider to be a baseline. Even still, we see
that at the threshold of 0.1, ONETA still has the second and eighth best result,
moreover we have achieved the strongest precision scores across all languages
(except for results with a recall that was reported as zero).
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>We have presented the ONETA system and its application to translation, which
was the only system to produce a value that beat the baseline, albeit when it
is in a mode where it e ectively a baseline itself. The system does show
noticeable ability to tune between precision and recall and as such it would likely be
e ective for usage in areas where precision is more important than recall, for
example in a semi-automated setting where showing annotators too many poor
quality translations would waste time. There are two principle aws with the
implementation as it stands: rstly, that the recall is limited and even our baseline
mode we only achieved a recall of about 20-30%, which needs to be overcome
by nding more translations than are just present in the graph. Secondly, the
system is not aware of senses, and the selection of multiple document collections
likely to show many di erent senses of a word, may help the system to
distinguish between translation pairs which do not rely on the most frequent senses
of words.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgment</title>
      <p>This publication has emanated from research supported in part by a research
grant from Science Foundation Ireland (SFI) under Grant Number SFI/12/RC/2289,
co-funded by the European Regional Development Fund, and the European
Unions Horizon 2020 research and innovation programme under grant
agreement No 731015, ELEXIS - European Lexical Infrastructure.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Aggarwal</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Asooja</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bordea</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Buitelaar</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Non-orthogonal explicit semantic analysis</article-title>
          .
          <source>In: Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics</source>
          . pp.
          <volume>92</volume>
          {
          <issue>100</issue>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Blei</surname>
            ,
            <given-names>D.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ng</surname>
            ,
            <given-names>A.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jordan</surname>
            ,
            <given-names>M.I.</given-names>
          </string-name>
          :
          <article-title>Latent Dirichlet Allocation</article-title>
          .
          <source>Journal of machine Learning research 3(Jan)</source>
          ,
          <volume>993</volume>
          {
          <fpage>1022</fpage>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Gabrilovich</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Markovitch</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , et al.:
          <article-title>Computing semantic relatedness using Wikipedia-based explicit semantic analysis</article-title>
          .
          <source>In: IJCAI</source>
          . vol.
          <volume>7</volume>
          , pp.
          <volume>1606</volume>
          {
          <issue>1611</issue>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>McCrae</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cimiano</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klinger</surname>
          </string-name>
          , R.:
          <article-title>Orthonormal explicit topic analysis for cross-lingual document matching</article-title>
          .
          <source>In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing</source>
          . pp.
          <volume>1732</volume>
          {
          <issue>1742</issue>
          (
          <year>2013</year>
          ), https://www.aclweb.org/anthology/D/D13/D13-1179.pdf
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Ordan</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gracia</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Alper</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kernerman</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          : Proceedings of TIAD-2017
          <string-name>
            <surname>Shared Task { Translation Inference Across Dictionaries</surname>
          </string-name>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Sorg</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cimiano</surname>
            ,
            <given-names>P.:</given-names>
          </string-name>
          <article-title>An experimental comparison of explicit semantic analysis implementations for cross-language retrieval</article-title>
          .
          <source>In: International Conference on Application of Natural Language to Information Systems</source>
          . pp.
          <volume>36</volume>
          {
          <fpage>48</fpage>
          . Springer (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Sorg</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cimiano</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Exploiting Wikipedia for cross-lingual and multilingual information retrieval</article-title>
          .
          <source>Data &amp; Knowledge Engineering</source>
          <volume>74</volume>
          , 26{
          <fpage>45</fpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Vulic</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moens</surname>
            ,
            <given-names>M.F.</given-names>
          </string-name>
          :
          <article-title>Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings</article-title>
          .
          <source>In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval</source>
          . pp.
          <volume>363</volume>
          {
          <fpage>372</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>