<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Automatic Mathematical Information Retrieval to Perform Translations up to Computer Algebra Systems</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Andre Greiner-Petter Information Science Group University of Konstanz</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <abstract>
        <p>Research Objectives and Plans In mathematics, LATEX is the de facto standard to prepare documents, e.g., scienti c publications. While some formulae are still developed using pen and paper, more complicated mathematical expressions used more and more often with computer algebra systems. Mathematical expressions are often manually transcribed to computer algebra systems. The goal of my doctoral thesis is to improve the e ciency of this work ow. My envisioned method will automatically semantically enrich mathematical expressions so that they can be imported to computer algebra systems and other systems which can take advantage of the semantics, such as search engines or automatic plagiarism detection systems. These imports should preserve essential semantic features of the expression. The translation process between semantic expressions and computer algebra systems was realized in my Master's thesis and the results of this work were published in paper [CSY+17]. Therefore, I will focus on the semantic enrichment process of generic LATEX expressions in my doctoral thesis. To achieve this goal, I am presenting the multiple-scan approach with three parts: (1) narrow down possible meanings only from the expression itself, without referring to the context of the expression; (2) re ne the process with conclusions from the nearby context of the expressions, and (3) improve the previous process by analyzing not only the nearby context but the overall topic of the whole scienti c paper or book, its references and other publications by the authors. Objective (1) concentrates on the expression itself, without extracting information from the context. My proposed approach is to exploit the coherence between the structure of a given formula and its meaning, constructing a Markov logic network to deduce possible semantic meanings. Therefore, each meaning gets a probability. If the highest probability is below a given threshold, it would be necessary to use (2) and (3) for improving the probabilities. Otherwise, the probability is su ciently high for concluding a semantic information. For example, consider the Jacobi polynomial Pn( ; )(cos(a )):</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The given expression has a superscript, a subscript and a following expression in parentheses. A leading expression
in letters with a following expression in parentheses may lead us to the conclusion that the leading expression
is the name of a function and the expression in parentheses is its argument. Additionally, the rst symbol P
has a superscript and a subscript. However, the Meixner-Pollaczek polynomial Pn( )(x; ) and the associated
Legendre function of the rst kind P (x) are also referenced with P and all of these functions has a superscript,
a subscript and an argument. But the Jacobi polynomial assumes a superscript of two parameters, while the
Meixner-Pollaczek polynomial and the Legendre function just assume one parameter in the superscript.
Assume we cannot conclude a unique mathematical object only from the expression itself. In those cases, we will
investigate the context of the input, such as that mentioned in objective (2). A large-scale corpus study showed
that around 70 percent of the symbolic elements in scienti c papers are denoted in the text. Therefore, the idea
is to identify the symbols in a formula (called identi ers) and in the surrounding text and identify its describing
key words (called de niens). Once we have extracted possible identi er-de niens pairs, we can score each pair
to conclude the most likely pair. This approach is already published by M. Schubotz et al. [SGL+16] in 2016.
The scoring process assumes that the chance for a correct combination of identi er and de niens depends on the
distance between identi er and its de niens and the distance of the identi er to the closest formula that contains
this identi er. I strongly believe we can improve the score process with the conclusions from my rst objective
above.</p>
      <p>If the correct semantic information is still unsure, objective (3) is the last way to nd a solution. Online
compendia, such as arXiv, can be used to discover the overall topic of a scienti c paper, the references and the
area of research of the authors. There already exists engines that try to nd dependencies between publications by
examine the citations, titles and abstracts. I am planning to incorporated realized approaches to solve objective
(3).</p>
      <p>Completed &amp; Remaining Research
Since objective (1) is part of the Part-of-Math (POM) tagger [You17], I focus on objectives (2) and (3), and
support the progress of the POM-Tagger collaboratively with the DRMF project team. In my rst contribution
towards (2), I analyzed the capabilities of existing tools to perform conversions from plain LATEX to content
MathML [SGPS+18]. The developed gold standard is used to measure accuracies and identify weaknesses of
state-of-the-art tools. First experiments have shown that we were able to increase the accuracies signi cantly by
adding semantic LATEX macros based on the results of the context analyzation process using [SGL+16].
The next project aims to solve the problem from a di erent perspective. Wikipedia as a highly frequently used
lexicon has over 17 million edits every month. During the last two years, 7 million di erent formulae have
been edited. Wikipedia uses TeX-Markup since 2003 for mathematical expressions. We consider the Wikipedia
word processor as a highly suitable test environment to add a recommendation system to enhance mathematical
LATEX input with semantics. The idea is to provide recommendations for replacing the plain LATEX input by a
semantic version using macros. A recommendation will be given by a machine learning (ML) algorithm trained
by the de ned backward translation, i.e., from semantic macros to plain LATEX. The algorithm will further learn
supervised from the selections from an editor. An implemented would than slowly increase semantic information
of mathematics in Wikipedia and improve the ML algorithm. Subsequently, the algorithm can be used for
automatic translations.</p>
      <p>Summary
I am still at the beginning of my doctoral research, and the described approaches are ambitious. However, our
rst contributions have shown valuable results, and the developed gold standard builds a fundamental construct
for evaluating our upcoming projects.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [CSY+17]
          <string-name>
            <surname>Howard</surname>
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Cohl</surname>
          </string-name>
          , Moritz Schubotz, Abdou Youssef, Andre Greiner-Petter, Jrgen Gerhard, Bonita V.
          <article-title>Saunders, Marjorie A</article-title>
          .
          <string-name>
            <surname>McClain</surname>
          </string-name>
          ,
          <string-name>
            <surname>Joon Bang</surname>
            , and
            <given-names>Kevin</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
          </string-name>
          .
          <article-title>Semantic preserving bijective mappings of mathematical formulae between document preparation systems and computer algebra systems</article-title>
          .
          <source>In Lecture Notes in Computer Science</source>
          , pages
          <volume>115</volume>
          {
          <fpage>131</fpage>
          . Springer International Publishing,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [SGL+16]
          <string-name>
            <surname>Moritz</surname>
            <given-names>Schubotz</given-names>
          </string-name>
          , Alexey Grigorev, Marcus Leich, Howard S. Cohl, Norman Meuschke, Bela Gipp, Abdou S. Youssef, and
          <string-name>
            <given-names>Volker</given-names>
            <surname>Markl</surname>
          </string-name>
          .
          <article-title>Semanti cation of identi ers in mathematics for better math information retrieval</article-title>
          .
          <source>In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '16</source>
          , pages
          <fpage>135</fpage>
          {
          <fpage>144</fpage>
          , New York, NY, USA,
          <year>2016</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [SGPS+18]
          <string-name>
            <surname>Moritz</surname>
            <given-names>Schubotz</given-names>
          </string-name>
          , Andre Greiner-Petter, Philipp Scharpf, Norman Meuschke, Howard Cohl, and
          <string-name>
            <given-names>Bela</given-names>
            <surname>Gipp</surname>
          </string-name>
          .
          <article-title>Improving the representation and conversion of mathematical formulae by considering their textual context</article-title>
          .
          <source>In Proceedings of the ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL)</source>
          , Fort Worth, USA, Jun.
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [You17]
          <string-name>
            <given-names>Abdou</given-names>
            <surname>Youssef</surname>
          </string-name>
          .
          <article-title>Part-of-math tagging and applications</article-title>
          .
          <source>In Lecture Notes in Computer Science</source>
          , pages
          <volume>356</volume>
          {
          <fpage>374</fpage>
          . Springer International Publishing,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>