<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Hagenberg, Austria</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Syntactic/Semantic Analysis for High-Precision Math Linguistics</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jan Frederik Schaefer</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michael Kohlhase</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <volume>1</volume>
      <fpage>3</fpage>
      <lpage>08</lpage>
      <abstract>
        <p>There have been some impressive advancements in the eld of natural language processing, mostly with the use of statistical methods. We believe that statistical methods are inherently limited in their accuracy, restricting their usefulness especially in applications where high accuracy is indispensable. One such application is the translation and formalization of STEM documents, where small errors can render the results unusable. In this paper, we will present our ongoing work on the Mathematical Grammatical Framework (MGF), which is based on the Grammatical Framework (GF). The goal of MGF is to accurately translate and formalize mathematical texts. At this stage, its coverage is restricted to a small set of examples, which can be correctly translated between English and German, and formalized in a logical representation.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Copyright c by the paper's authors. Copying permitted for private and academic purposes.
50%</p>
      <sec id="sec-1-1">
        <title>Producer Tasks</title>
        <p>103 1 Concepts</p>
      </sec>
      <sec id="sec-1-2">
        <title>Consumer Tasks</title>
        <p>106 1 Concepts</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Coverage</title>
      <p>int(n) ) (prime(n) , :9m:int(m) ^ less(1; m) ^ less(m; n) ^ divides(m; n)). Note that we interpreted the
adjective \positive" intersectively { more on this in Section 3.3.1.</p>
      <p>Aarne Ranta identi es two classes of tractable areas for natural language processing: consumer tasks and
producer tasks [Ran16] (Figure 1). Consumer tasks require large coverage, usually achieved with statistical
methods, but are inherently limited in their accuracy. A prominent example for this is full-range machine
translation as for example with Google Translate. Mathematical documents, however, fall in the category of
producer tasks : The coverage is restricted to a technical subset of a language with a domain-speci c vocabulary.
Such producer tasks can be tackled with Aarne Ranta's Grammatical Framework (GF) [Ran04; Ran11; GF].</p>
      <p>In this paper, we are going to describe our ongoing e orts to create the Mathematical Grammatical Framework
(MGF). It is based on GF and currently supports the translation and formalization of a small set of mathematical
sentences. After providing an overview of related work and a brief introduction to GF in Section 2, we will
describe our setup in more detail in Section 3. Section 4 provides an evaluation of the current state of our
system, along with a few examples. Section 5 concludes the paper.</p>
      <p>Acknowledgements
We are very grateful to Aarne Ranta, Magdalena Wolska and Deyan Ginev, whose expertise and insights greatly
helped our e orts to create MGF.
2</p>
    </sec>
    <sec id="sec-3">
      <title>Related Work and Preliminaries</title>
      <p>Another project that uses GF for mathematical texts is the \Mathematical Grammar Library" [CS12; SX11;
Arc+12]. It supports the translation of propositions like \the absolute value of x is greater than x or equal
to x " in an impressive number of languages as well as the conversion to LATEX. It is based on the WebALT
project [Cap], which used GF to translate mathematical exercises for students into many languages. Both of
these projects based their grammars on OpenMath's content dictionaries [Bus+04]. We believe that MGF can
achieve wider discourse coverage without this restriction.</p>
      <p>Other research has focused on informal proofs like Claus Zinn's work on the veri cation of informal textbook
proofs [Zin04], in which he uses discourse representation theory [Kam81] for the formal representation of proofs,
and Magdalena Wolska's work on informal proofs by students [Wol13; WKK04]. Some projects avoid challenges
in informal language by introducing a controlled natural language, as for example the System for Automated
Deduction (SAD) with its ForTheL language [VLP07].</p>
      <p>An analysis of natural language in mathematics can be found in [Zin04], [Wol13] and [Bau99]. The structure
(and parsing) of formulae has been investigated in [Gin11].</p>
      <p>The work mentioned so far has mostly focused on the understanding and formalization of informal
language. The other direction has been studied as well. A particularly interesting case is the generation of
humanunderstandable representations of formal proofs [Hor99; Hua89].
2.1</p>
      <sec id="sec-3-1">
        <title>Introduction to GF</title>
        <p>GF (the Grammatical Framework) is \a programming language for multilingual grammar applications" [GF]. GF
Grammars can be divided into two parts: The abstract syntax and the concrete syntax(es). The abstract syntax
describes what meanings can be expressed in the grammar. A small example abstract syntax module is shown in
Listing 1. The concrete syntax (Listing 2) describes how these meanings are expressed in a particular language,
i.e. it provides a mapping to and from strings. GF comes with a resource grammar library that provides basic
(a) Parsing English</p>
        <p>(b) Linearizing into French
syntactic rules for over 35 languages. This way, we can write grammars without dealing with language-speci c
syntax details as in ections, word order, etc.</p>
        <p>Listing 1: Example abstract syntax module.
group : Object;
abelian : Property;
commutative : Property;
addProperty : Property
hasProperty : Object</p>
        <p>&gt; Object
&gt; Property
&gt; Object;
&gt; Statement;
g
g</p>
        <p>lin
concrete MathEng of Math = open SyntaxEng, ParadigmsEng in f
lincat</p>
        <p>Object = CN; common noun
Property = AP; adjective phrase
Statement = S; sentence
group = mkCN (mkN "group");
abelian = mkAP (mkA "Abelian");
commutative = mkAP (mkA "commutative");
addProperty prop obj = mkCN prop obj;
hasProperty obj prop = mkS (mkCl (mkNP aSg Det obj) prop);
Listing 2: Example concrete syntax module for English. We use the resource grammar library to avoid dealing
with syntactic details.</p>
        <p>Now, we can take for example the English string \an Abelian group is commutative" and parse it with the
concrete English syntax into its semantic tree representation (Figure 2a). If we also create a concrete syntax for
the French language, we can use it to linearize this tree into a French string (Figure 2b). In other words, we
translate a statement from English to French.
3</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Using GF for Mathematical Documents</title>
      <p>Mathematical Documents consist of natural language (discourse), formulae, diagrams, tables, images etc. We
concentrate on the interplay of discourse and formulae, which so far has been little studied.</p>
      <p>When creating a grammar, it is important to nd the right balance between a semantic design and a syntactic
one. In general, we aim at a semantic design which abstracts away from any particular language and only
focuses on the meaning. While this simpli es content formalization, there are some draw-backs, which require
us to compromise and keep some syntactic elements. Consider the statement \f is injective, but not surjective".
The meaning of it is something like injective(f ) ^ :surjective(f ), which could just as well be written as \f is
injective and not surjective". During translation, it is desirable to maintain the contrast indicated by the word
\but ", which means that it has to be expressed in the parse trees.</p>
      <p>Our grammar consists of two parts: a part for parsing formulae (Section 3.2) and a part for parsing discourse
(Section 3.3). This division suggests itself, especially when it comes to the concrete syntaxes: For discourse,
we need concrete syntaxes for di erent natural languages (English, German, ...), as well as for a logical
representation. Formulae, on the other hand, are not very language dependent, but rather representation dependent
(LATEX, MathML, ...). There are, however, some formulae that have di erent representations in di erent
languages. For example, the formula \gcd(a; b)" (\gcd" is the \greatest common divisor ") translates to \ggT(a; b)"
in German (\ggT" stands for \gro ter gemeinsamer Teiler ").</p>
      <p>The discourse vocabulary, as well as identi ers and operators in formulae are stored in lexica. So far, these
have been mostly hand-written to cover a few examples. In the future, we can try to generate lexica of identi ers
and operators from corpora. For the natural language vocabulary, multilingual resources (like SMGloM [Gin+16])
would be needed.
3.1</p>
      <sec id="sec-4-1">
        <title>Representation of Mathematical Language</title>
        <p>In natural language processing, the representation of sentences is usually rather straight-forward, as there are
standardized ways to represent a sentence as a sequence of characters. In mathematics, however, this is more
challenging, as formulae need to be represented as well.</p>
        <p>Many mathematical papers are written in LATEX. However, parsing LATEX is not an easy task. As mentioned
before, our work is focused on the arXMLiv corpus [Arxa], which consists of HTML5 documents generated from
LATEX with LaTeXML [Mil]. The formulae are represented as MathML [Aus+03] and come in two di erent
avours: Content MathML and Presentation MathML. Content MathML is a more semantic representation,
which would theoretically make it the better choice for formalization. However, LaTeXML does not generate
high-quality content MathML, as this is a highly non-trivial undertaking. This leaves us with presentation
MathML. To avoid the hassle of parsing XML in GF, we decided to use a simpli ed representation based on the
presentation markup. For example, the formula \1 &lt; m &lt; n" would be represented as $ mrow( mn( 1 ) mo( &lt; )
mi( m ) mo( &lt; ) mi( n ) ) $. It is easy to convert between this representation and presentation MathML.</p>
        <p>Of course, it is always possible to create more concrete syntaxes to support other formula representations.
For example, there is a di erent concrete syntax for logical representations. The example above e.g. would be
represented as less(1; m) ^ less(m; n).
3.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>The Formula Grammar</title>
        <p>As mentioned before, formulae can take di erent grammatical roles in a sentence, and based on that we parse
them di erently. Let's demonstrate that with the formula \n &gt; 2". It could be a part of the declaration \let
n &gt; 2 be an integer ". In this case, an identi er, \n", is introduced as an integer, and it is further restricted to
be greater than 2 in the same formula it was introduced in. The semantic representation could be something
like int(n) ^ greater(n; 2). The logical representation generated with our grammar actually looks a bit di erent
(Section 3.3). Anyway, it becomes obvious that we need to identify and extract \n" in order to assert that it is
an int. A di erent example is the sentence \we conclude that n &gt; 2". In this case, the formula takes the place
of a whole sub-sentence and there is no need to extract an identi er.</p>
        <p>In these two scenarios, the formulae are represented by di erent GF categories: Identi er (with potential
restrictions) and Statement. Figure 3 shows the parse tree of \n &gt; 2" as an Identi er (as in the declaration \let
n &gt; 2 be an integer "). The root node combines the identi er \n", the relation \&gt;", and the expression \2".
Figure 4 shows the parse tree for \n &gt; 2" as a Statement (as in \we conclude that n &gt; 2"). Note that the trees
di er depending on the type of formula.</p>
        <p>Another noteworthy design decision was to represent the binary relation \&gt;" simply as a leaf node. An
alternative approach would have been to represent the relation as a function which takes two expressions as
arguments. We decided against this, so that each relation can be used in di erent ways. This is particularly
useful, because binary relations can be combined into relations of higher arity. For example, in \0 &lt; r 1", we
could consider &lt; as a ternary relation. De ning every possible combination of these higher-arity relations
would be very di cult as they can get quite long { consider e.g. the relation \0 = t0 &lt; t1 &lt; : : : &lt; tn = 1", in
which things are complicated even further with an ellipsis.
3.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>The Discourse Grammar</title>
        <p>The discourse grammar deals with the natural language in documents. It basically combines natural language
rules (which we get from the resource grammar library) with mathematical vocabulary, and extends them with
a set of idioms that are used in mathematical language.
3.3.1</p>
      </sec>
      <sec id="sec-4-4">
        <title>Mathematical Objects</title>
        <p>Let's rst take a look at mathematical objects, the \building blocks" of mathematics. With this we refer to
phrases like \positive integer " or \bijection on the real numbers ". The descriptions of mathematical objects can
get quite complex, as for example in the declaration \let f be a continuous real-valued function de ned on the
interval [0,1] that is monotonically increasing on [ 41 ; 34 ]".</p>
        <p>In GF, we represent mathematical objects with the category MObj. There are atomic MObjs like integer MObj.
More complex MObjs can be composed by putting further restrictions on the object. For example, in
\positive integer " (parse tree in Figure 5), integer MObj is further restricted with positive MObjProp. The leaves
integer MObj and positive MObjProp are de ned in the mathematical lexicon.</p>
        <p>With the concrete syntax for logical representations, \positive integer " gets linearized into the -function
x:pos(x) ^ int(x). This way, the declaration \let n be a positive integer " can get linearized into ( x:pos(x) ^
int(x))n, which -reduces to pos(n) ^ int(n). Note that we interpreted the adjective \positive" intersectively,
i.e. the set of positive integers is the intersection of the set of positive objects with the set of integers. Most
mathematical adjectives are, with few exceptions: \proper " is not, but still subsective: the set of \proper
subgroup"s is a subset of the set of \subgroup"s. Examples of mathematical adjectives that are not even subsective
are \improper " and \generalized ", but also the pre x \quasi " as in \quasi-group".
One of the motivations behind this project is the formalization and translation of SMGloM [Gin+16], the
\Semantic, Multilingual Glossary for Mathematics", which contains a large number of de nitions for di erent
mathematical elds in a variety of natural languages. A very simple example de nition is \a unital quasigroup is called
a loop". The parse tree for this de nition is shown in Figure 6. It demonstrates many of the design decisions
we've made so far. The root node combines two mathematical objects (MObjs, see also Section 3.3.1), which are
the de niens (\unital quasigroup") and the de niendum (\loop"), as well as an Identi er (see Section 3.2), which
can get assigned to the de niens. In this case, however, no identi er is provided, which is denoted by the special
identi er no identi er. For the logical representation, a default identi er, x0, is introduced in this situation:
(8x0:( x:unital(x) ^ quasigroup(x))x0 ) (( x:loop(x))x0)). Applying -reduction to this expression, we get
8x0:unital(x0) ^ quasigroup(x0) ) loop(x0).</p>
        <p>Usually, de nitions are more complex and have an additional condition. An example for this is the de nition
\an integer n is called even, i 2jn". The de nition core \an integer n is called even" could grammatically be
a complete de nition on its own, but there is the additional condition \2jn". The parse tree (Figure 7) has the
root node i de nition, which takes a de nition (core) and a Statement that is the additional condition. The
generated logical representation for this de nition is (8n:(( x:int(x))n) ) (( x:even(x))n , divides(2; n)))
(or simpli ed via -reduction 8n:int(n) ) (even(n) , divides(2; n))).
4</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Evaluation and Examples</title>
      <p>The coverage of our framework is still limited to a few examples. Apart from the small examples shown above,
a few larger examples are covered as well, like the previously mentioned de nition of a prime number:
\A positive integer n is called prime, i there is no integer 1 &lt; m &lt; n such that mjn".</p>
      <p>Note that we used a shorthand notation of the formula representation for conciseness. The generated German
translation is:
\Eine positive ganze Zahl n ist prim genau dann, wenn es keine ganze Zahl 1 &lt; m &lt; n gibt, sodass mjn".</p>
      <p>Which is not the translation we suggested in the introduction, but de nitely correct. The following logical
representation is generated from the parse tree (which is too large to show here):</p>
      <p>(8n:(( x:pos(x) ^ int(x))n) ) (( x:prime(x))n , (:9m:( x:int(x))m ^ less(1; m) ^ less(m; n) ^
(divides(m; n)))))</p>
      <p>This can be -reduced to the desired representation mentioned in the introduction:
8n:pos(n) ^ int(n) ) (prime(n) , :9m:int(m) ^ divides(m; n) ^ less(1; m) ^ less(m; n))
More examples and the source code can be found at [MGF].
5</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion, Next Steps and Future Work</title>
      <p>We have presented our ongoing e orts to use GF for the translation and formalization of mathematical
documents. The syntactical framework is established and tested on a few example sentences, which were successfully
translated between English and German and formalized into a logical representation.</p>
      <p>It should be mentioned that, once English sentences could be parsed, it was surprisingly easy to create
rudimentary German concrete syntaxes for translation (it took only about 2 hours to get a rst translation of
the prime number de nition). Adding more languages could be even easier, if we better take advantage of GF's
functors, which essentially allow to have shared code for the common parts of concrete syntaxes.</p>
      <p>Some of the next steps would be:</p>
      <p>Switching to Discourse Representation Theory (DRT) [Kam81] for the logical representation. This would be
required for the representation of anaphora, but could also simplify, for example, the handling of declarations.
Extending the grammars for wider coverage. One of the more challenging extensions would be the handling
of multiple identi ers in declarations (as e.g. in \let n &lt; m be integers"). Armin Fiedler and Xiaorong
Huang's work on aggregation [FH95] might be useful for this.</p>
      <p>Using large, multilingual lexica for increased coverage.</p>
      <p>Using the framework (once it has su ciently matured) for translations in SMGloM [Gin+16] (the Semantic,
Multilingual Glossary for Mathematics).</p>
      <p>The grammar, along with a few examples, can be found at [MGF].
[Arc+12]</p>
      <p>Dominique Archambault et al. \Using GF in multimodal assistants for mathematics". In: Digitization
and E-Inclusion in Mathematics and Science. 2012, pp. 1{12. url: https://www.molto-project.
eu/sites/default/files/ArchambaultCaprottiRantaSaludes_0.pdf.
arXMLiv. seen July 2018. url: https://kwarc.info/projects/arXMLiv/.</p>
      <p>arxiv.org e-Print archive. url: http://www.arxiv.org.
[Aus+03] Ron Ausbrooks et al. Mathematical Markup Language (MathML) Version 2.0 (second edition). W3C
Recommendation. World Wide Web Consortium (W3C), 2003. url: http : / / www . w3 . org / TR /
MathML2.</p>
      <p>Judith Baur. \Syntax und Semantik mathematischer Texte { ein Prototyp". MA thesis. Saarbrucken,
Germany: Fachrichtung Computerlinguistik, Universitat des Saarlandes, 1999.</p>
      <p>Olga Caprotti. WebALT! Deliver Mathematics Everywhere.</p>
      <p>GF - Grammatical Framework. url: http : / / www . grammaticalframework . org (visited on
09/27/2017).</p>
      <p>Deyan Ginev et al. \The SMGloM Project and System. Towards a Terminology and Ontology for
Mathematics". In: Mathematical Software - ICMS 2016 - 5th International Congress. Ed. by
GertMartin Greuel et al. Vol. 9725. LNCS. Springer, 2016. doi: 10.1007/978- 3- 319- 42432- 3. url:
http://kwarc.info/kohlhase/papers/icms16-smglom.pdf.</p>
      <p>Deyan Ginev. \The Structure of Mathematical Expressions". Master's Thesis. Bremen, Germany:
Jacobs University Bremen, Aug. 2011. url: http://kwarc.info/people/dginev/publications/
DeyanGinev_Mastersthesis.pdf.</p>
      <p>Helmut Horacek. \Presenting Proofs in a Human-Oriented Way". In: Proceedings of the 16th
Conference on Automated Deduction. Ed. by Harald Ganzinger. LNAI 1632. Springer Verlag, 1999, pp. 142{
156.</p>
      <p>Xiaorong Huang. \Proof Transformation Towards Human Reasoning Style". In: Proceedings of the of
13th GWAI. Ed. by Dieter Metzing. Informatik-Fachberichte 216. Springer Verlag, 1989, pp. 37{42.
Hans Kamp. \A Theory of Truth and Semantic Representation". In: Formal Methods in the Study
of Language. Ed. by J. Groenendijk, Th. Janssen, and M. Stokhof. Amsterdam, Netherlands:
Mathematisch Centrum Tracts, 1981, pp. 277{322.</p>
      <p>The Mathematical Grammatical Framework. url: https://gl.kwarc.info/smglom/GF (visited on
07/14/2018).</p>
      <p>Bruce Miller. LaTeXML: A LATEX to XML Converter. url: http://dlmf.nist.gov/LaTeXML/.
Aarne Ranta. \Grammatical Framework | A Type-Theoretical Grammar Formalism". In: Journal
of Functional Programming 14.2 (2004), pp. 145{189.</p>
      <p>Aarne Ranta. Grammatical Framework: Programming with Multilingual Grammars. ISBN-10:
157586-626-9 (Paper), 1-57586-627-7 (Cloth). Stanford: CSLI Publications, 2011.</p>
      <p>Aarne Ranta. Grammatical Framework - Formalizing the Grammars of the World. Sept. 7, 2016.
url: http://www.grammaticalframework.org/~aarne/gf-google-2016.pdf.</p>
      <p>Jordi Saludes and Sebastian Xambo. \The GF Mathematics Library". In: THedu. Ed. by Pedro
Quaresma and Ralph-Johan Back. Vol. 79. EPTCS. 2011, pp. 102{110. doi: 10.4204/EPTCS.79.6.
Konstantin Verchinine, Er Lyaletski, and Andrei Paskevich. \System for Automated Deduction
(SAD): A Tool for Proof Veri cation". In: Proceedings of the 21st International Conference on
Automated Deduction, number 4603 in Lecture Notes in Arti cial Intelligence. SpringerVerlag, 2007,
pp. 398{403.
[Wol13]</p>
      <p>Claus Zinn. \Understanding Informal Mathematical Discourse". PhD thesis. Technischen Fakultat
der Universitat Erlangen-Nurnberg, 2004. url: https : / / sites . google . com / site / clauszinn /
verifying-informal-proofs/37_04.pdf.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [Bus+04]
          <string-name>
            <given-names>Stephen</given-names>
            <surname>Buswell</surname>
          </string-name>
          et al.
          <source>The Open Math Standard, Version 2.0. Tech. rep. The OpenMath Society</source>
          ,
          <year>2004</year>
          . url: http://www.openmath.org/standard/om20.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Olga</given-names>
            <surname>Caprotti</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jordi</given-names>
            <surname>Saludes</surname>
          </string-name>
          . \
          <article-title>The Gf Mathematical Grammar Library"</article-title>
          .
          <source>In: 24th OpenMath Workshop, 7th Workshop on Mathematical User Interfaces (MathUI), and Intelligent Computer Mathematics Work in Progress. (Bremen, Germany, July</source>
          <volume>9</volume>
          {
          <fpage>13</fpage>
          ,
          <year>2012</year>
          ). Ed. by James Davenport et al.
          <source>CEUR Workshop Proceedings 921. Aachen</source>
          ,
          <year>2012</year>
          , pp.
          <volume>49</volume>
          {
          <fpage>52</fpage>
          . url: http://ceur-ws.org/Vol921/openmath-02.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Armin</given-names>
            <surname>Fiedler</surname>
          </string-name>
          and
          <string-name>
            <given-names>Xiaorong</given-names>
            <surname>Huang</surname>
          </string-name>
          . \
          <article-title>Aggregation in the Generation of Argumentative Texts"</article-title>
          .
          <source>In: Proceedings of the 5th European Workshop on Natural Language Generation</source>
          . Leiden, Netherlands: Rijks University Leiden,
          <year>1995</year>
          , pp.
          <volume>5</volume>
          {
          <fpage>9</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [WKK04]
          <article-title>Magdalena Wolska and Ivana Kruij -Korbayova. \Analysis of Mixed Natural and Symbolic Input in Mathematical Dialogs"</article-title>
          .
          <source>In: ACL</source>
          .
          <year>2004</year>
          , pp.
          <volume>25</volume>
          {
          <fpage>32</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>[Zin04]</mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>