<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Notation-based Semantification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ion Toloaca</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michael Kohlhase</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <abstract>
        <p>The scientific community produces a large number of mathematical papers (approximately 108:000 new papers per year [arX]), which raises the importance of machine based processing of such documents. Unfortunately, the most popular formats in which these papers are found (for instance, LATEX) do not contain much information that would allow the computers to infer the human-understandable knowledge contained within a paper. Since, at this point, changing these formats is not practically possible, the other solution is to add a semantic flavor to the existing documents by translating them into a more suitable format, for instance, Content MathML. The current scientific community publication went through a series of evolutions in search of the best method of writing scientific documents. This process was highly influenced by the invention and spreading of the internet. Scientists understood the necessity of a standard that could help them write and exchange their findings in an efficient way. A lot of effort has gone into translating books into digital documents. The next step in this evolution is translating digital documents into knowledge-rich digital documents. This next step can only happen if a new feasible way of transition appears, which MathSemantifier attempts to become. Ambiguity is one of the main reasons that makes semantification complex. Mathematical documents are not a simple collection of symbols. The main use of these documents emerges only when the intended semantics of a document is accessible. However, humans tend to be lazy in writing down the whole graph, but instead rely on implicit human knowledge to decipher these documents. This is where ambiguity comes into play, when the author relies on the ability of the human to use the context of document in order to pinpoint the actual meaning an expression. Ambiguities can be largely divided into two: structural and idiomatic ambiguities. A simple example that demonstrates the concept of structural ambiguities can be “ sin x = 2”. It can mean one of the following: Copyright c by the paper's authors. Copying permitted for private and academic purposes. In: Alexandre Rademaker and Vinay K. Chaudhri (eds.): Proceedings of the 6th Workshop on Formal Ontologies meet Industry, Rio de Janeiro, Brazil, 22-SEP-2014, published at http://ceur-ws.org</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <sec id="sec-1-1">
        <title>1. sin applied to x2</title>
      </sec>
      <sec id="sec-1-2">
        <title>2. 12 times sin applied to x</title>
        <p>Contrary to structural ambiguities, idiomatic ambiguities are not due to different parse trees.
Given one single parse tree, some formulae allow for multiple readings. A standard example
would be Bn. This could be:</p>
        <sec id="sec-1-2-1">
          <title>1. The sequence of Bernoulli numbers</title>
        </sec>
        <sec id="sec-1-2-2">
          <title>2. A user defined sequence</title>
        </sec>
        <sec id="sec-1-2-3">
          <title>3. The vertex of one of a series of geometric objects</title>
          <p>Consider c(a(b)). The most natural interpretation is c of a of b. However, there arise
multiple meanings if we consider that the application could be interpreted as multiplication. The
human reader discards such meanings by convention and experience of handling
mathematical documents, however, teaching this to a computer is a complex task. The approach that
MathSemantifier takes is simply extracting all the possible meanings, while trying to apply
heuristics to weed out impossible ones.</p>
          <p>Possible Applications of Semantification include:</p>
          <p>MathWebSearch [HPK14] is a search engine for mathematical formulas. Such search
engines could greatly benefit from semantification. The idea is that a search engine is only
as good as the database of information is. Semantification allows a new kind of queries
could be possible: semantic queries. Rather than searching for strings, or formulas with
free form subterms, the user could specify the meaning of the sought expression. This would
improve a lot the relevance of the results, since there will be no result that matched just
because it was presented in a similar manner.</p>
          <p>Another possibility for using semantification is theorem proving and correctness checking.
One possible application would be assisting authors in writing semantic content by providing
real-time feedback.</p>
          <p>The possibilities extend even beyond this. Using semantified content, rather than having
a database of CMML expressions, it is possible to create a smart knowledge management
system that could be used to create expert systems. The user could then ask questions, or
create complex queries, to exploit the full power of semantic content.</p>
          <p>Semantification will not make all of the above directly possible, however it is a necessary step
towards achieving goals similar to the ones described above, that require more knowledge about
the used content than just how it is rendered.
1.1</p>
          <p>An Introduction to MathSemantifier
In a nutshell the MathSemantifier algorithm converts Presentation MathML (PMML) to
Content MathML (CMML [ABC+10]) it is the converse of the presentation algorithm that converts
CMML to PMML [KMR08] using a database of notation renderings. In Figure 1.1 a typical
example of what the presentation algorithm produces is displayed. The first child of the semantics
node contains the PMML that corresponds to the CMML contained in the annotation-xml
node. In order to produce this output, the algorithm used the notation natarith addition
(shown in Figure 1.1 as well). OMDoc Notations like natarith addition are initially written
in sTEX [Koh08], and then converted to OMDoc using LATEXML [Mil].
MathSemantifier converts PMML into valid CMML as described above. In order to perform
this task, it needs to match PMML against a list of notations. This is achieved by compiling the
notations into a context free grammar, and using a CFG parsing engine to parse the PMML. A
parse returns a list of possible parse trees, out of which MathSemantifier extracts information
regarding what notations matched at top level and continues with the arguments of the found
notation recursively. We use the Marpa Grammar Engine (see [Keg]).</p>
          <p>We generate a database of notations, we use the SMGloM corpus [GIJ+, Koh]., a mathematical
thesaurus, which contains over 1000 notation definitions for elementary mathematical concepts.
This sTEX-encoded corpus is converted to OMDoc using LATEXML. MathSemantifier is an
extension of the MMT system [Rab] a scalable mathematical knowledge processing engine that
reads the OMDoc notation definition and generates a context free grammar from them.</p>
          <p>MathSemantifier takes this generated CFG and input from the Web UI in order to produce
possible parse trees of the input. The Marpa grammar engine is the proposed tool for this
purpose. The main advantages are that Marpa handles ambiguous grammars and provides
control over the parsing process. The author of Marpa provides a more detailed analysis of the
advantages of Marpa.
2</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>The MathSemantifier System</title>
      <p>The major idea of MathSemantifier is, as already described in the introduction, finding
possible Content MathML readings for Presentation MathML input expressions.</p>
      <p>The general flow of a single semantification can be described as follows:
1. Generation of Context Free Grammar from MMT Notations
2. Parsing using the Marpa Grammar Engine and the generated CFG to detect the top level
notation
3. Parsing the arguments of the top level notation recursively
4. Using the parse trees from step 2 and 3 to generate an internal representation of the meaning
trees</p>
      <sec id="sec-2-1">
        <title>5. Converting the meaning trees to Content MathML</title>
      </sec>
      <sec id="sec-2-2">
        <title>6. Displaying the Content MathML trees in the frontend</title>
        <p>A CFG-based solution was chosen because of the preexisting parsing frameworks like the Marpa
Grammar Engine. Such frameworks solve all the parsing-related technical problems, like parsing
ambiguous expressions, different kinds of recursion, while also providing a high degree of freedom.</p>
        <p>The main goals of the system can be expressed in a concise manner as follows:
1. Generating the correct set of parses efficiently and effectively
2. Providing opportunities for improvement for further research in the area
The architecture can be roughly divided into four parts as shown in Figure 2.1.
2.1</p>
        <p>The Web User Interface
The Web UI is a core component of MathSemantifier. It is intended to be a lightweight
solution that queries a server for the results of more computationally-intensive tasks.</p>
        <p>The interface consists of an input area, where MathML needs to be inputted, and three
options:
1. Semantify (The user can guide the semantification of the top symbol directly, by choosing
the correct matching range, notation and argument positions)
2. Show Semantic Tree (The other option is to ask for all the possibilities and get all the
semantic trees)
3. Evaluation (The user is walked through a series of examples to demo the functionality)
The repository containing the Web UI can be found on GitHub [Tolb].</p>
        <p>User Guided Semantification is performed as follows: The user provides Presentation MathML
as input to the system, then uses the Semantify button to reveal a list of top level notation
names. The names of the notations are derived from the notation paths as follows: archive
name + symbol name, for example, natarith addition refers to the addition of natural
numbers. After determining which notation is the correct one, the user needs to make sure
that the arguments were detected properly and, finally, the resulting Content MathML tree is
displayed (see Figure 2.2 for a similar representation of the result).</p>
        <p>The easier but computationally more expensive alternative is to simply generate all the
possible parse trees at once and display them. The user simply needs to click the Show Semantic
Tree option.</p>
        <p>For the example shown in
Figure 2.2, there is a total of 342
different readings. This can be
easily explained, since Invisible
Times, Arithmetic Plus and
Mod have each multiple
notations definitions (that originate
from different MMT archives, for
instance), and there are no
limitations on what kind of notation
definitions can go together in the same
CMML tree.</p>
        <p>In order to minimize the Content
MathML, the standard allows
subtree sharing. To enable this option,
the Use term sharing checkbox
should be checked. In that case,
the terms is shared among
different readings.</p>
      </sec>
      <sec id="sec-2-3">
        <title>1. Compile the MMT Notations into a CFG Grammar</title>
        <p>2. Receive the input from the Web UI and build its Semantic Tree</p>
      </sec>
      <sec id="sec-2-4">
        <title>3. Delegate the parsing to the Parsing Backends</title>
        <p>We decided to put the core logic of the application in MMT in order to make it easier to
interoperate with the MMT Notation Database, as well as with any other MMT components
that may need MathSemantifier.</p>
        <p>The code can be found as part of the MMT codebase [KT].</p>
        <p>Let us look at the components of the MMT Backends in more detail below.
2.3</p>
        <p>The Grammar Generator
The Grammar Generator aggregates all the knowledge contained in the MMT Notations into
one Context Free Grammar. The grammar is shaped into the normal form accepted by the
2.2</p>
        <p>The MMT Backend</p>
      </sec>
      <sec id="sec-2-5">
        <title>The MMT Backend is a Server</title>
        <p>Extension that is part of MMT.</p>
        <p>As shown in Figure 2.1, it is linked
via REST with the Web UI and
the Parsing Backends.</p>
        <p>Its role can be summarized to
the following core functions:
Marpa Grammar Engine. To achieve this, the format used to store the notations in MMT is
decomposed into CFG rules. Otherwise said, the tree-like structure of each formula, that is
stored as nested applications of MMT Markers needs to be serialized into CFG rules.</p>
        <p>This is done in several steps:</p>
      </sec>
      <sec id="sec-2-6">
        <title>1. Break apart the MMT Marker trees</title>
        <p>into level by level representations</p>
      </sec>
      <sec id="sec-2-7">
        <title>2. Transform the intermediate representation into valid CFG rules</title>
      </sec>
      <sec id="sec-2-8">
        <title>The fundamental structure of the CFG is established using a preamble as shown in Figure 2.3. Note the default action is the</title>
        <p>Grammar Entry Point. It is precisely
what tells the Grammar Engine to build
parse trees, and also determines their
structure.</p>
        <p>The entry point into the grammar is the
Expression rule. It can be an MMT
Notation, or Presentation MathML. The prec0
rule should be read as precedence zero, that
is - the lowest precedence there is.</p>
        <p>Precedence handling is done using a
commonly-used method for including it in
CFGs. The Notation Precedences
secthioown eixnacFtilgyuirteis2d.3ongei.veTshea nquumicbkergolafnpcreecaet- Figure 2.3: CFG Preamble
dences needs to be known in advance, then,
for each precedence value a corresponding
precN rule is created.</p>
        <p>The rule contains all the notations with
that precedence, and, one of the alternatives is going to a higher precedence value. In the
example below there are 15 used precedence values (prec0 corresponds to precedence 1).
However, the Notation Precedences section shows only half of the concept.</p>
        <p>To make this approach work,
what is also needed is that the
arguments in a rule of a certain
precedence N can only contain
notations with precedence K if K &gt; Figure 2.4: MMT Notation in CFG
N . This is done by the Argument
Precedences part in the preamble (as shown in Figure 2.3).</p>
        <p>Finally, Figure 2.4 presents an example of what the natharith addition MMT notation from
the smglom/numberfields archive translated to CFG rules looks like.
The Semantic Tree Generator works by recursively querying the Parsing Backend and
using the result to construct the tree of possible meanings.</p>
        <p>The parse trees stored in MMT have the following structure:</p>
        <p>Variants - represents a list of possible readings. It is always the top node in any parse
tree.</p>
        <p>Notation - a notation detected in the input. This node contains the name and arguments
of the notation it represents.</p>
        <p>Argument - an argument of a notation. The plugin is recursively called on it to construct
its meaning subtree as well.</p>
      </sec>
      <sec id="sec-2-9">
        <title>RawString - the ground term representation.</title>
        <p>The final part of the semantification processes is converting the internal representation to a
standard one, which is CMML in this case. MMT provides a simple API which requires:</p>
      </sec>
      <sec id="sec-2-10">
        <title>1. MMT notation path</title>
        <p>2. Argument maps (maps from the argument number to the corresponding substring)
Both of which are available in the representation structure. The argument path is obtained
by extracting the argument number from the argument name, and looking up in a map of
paths created at the time of grammar creation. This implies the grammar rule names are
overloaded with meaning, however, the possibilities are very limited in this aspect since the
parsing framework used does not give complete control over the parsing process.</p>
        <p>The Parsing Backend Figure 2.1 receives requests from the Web UI directly for Guided
Semantification and from the MMT Backend for Semantic Tree Generation. All the parsing
related work is delegated to this backend.</p>
        <p>The code of the Parsing Backend can be found on GitHub [Tola].</p>
        <p>The parsing backend consists of two parts.</p>
        <p>First of all, the CFG needs to be queried and parsed. This is implemented using lazy
evaluation, which means that it is only done when a request actually comes.</p>
        <p>The serialized CFG is unpacked and fed to the Marpa Parser Generator.</p>
        <p>Second, useful information needs to be extracted from the generated parse trees. Note that
going through all the parse trees is not practical, so only the first N (currently 1000) parse
trees are processed. This still gives the correct results in most cases since the grammar rules are
optimized for giving preference to parse trees that are more likely to be correct.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Evaluation</title>
      <p>This section presents an evaluation of MathSemantifier from the point of view of efficiency
and effectiveness of semantic tree generation. Next, the interoperability of the system with other
possible applications is examined, which mostly depends on the backend APIs.</p>
      <p>Testing MathSemantifier through normal operation is not a trivial task, because the results
need a human expert to check whether the results are indeed correct. Fortunately, the
MathHub Glossary [KWA] contains a sufficient number (about 3000) of examples of Presentation
MathML with Content MathML annotations that result from applying the Presentation
Algorithm discussed in the introduction. Since MathSemantifier is the partial inverse of the
Presentation Algorithm, checking whether the results are indeed correct boils down to
comparing two Content MathML trees.</p>
      <p>For the purpose of testing MathSemantifier on the MathHub Glossary, the Evaluation
option mentioned in the Web UI section is used. It processes examples from the MathHub
Glossary.</p>
      <p>A visual control of approximately 500 examples from the MathHub Glossary revealed the
following results: About 40% of the examples are semantified correctly. However, this is largely
because of reasons that do not depend on MathSemantifier itself.</p>
      <p>Contrary to what the percentage number suggests, for expressions with less than 1000 parse
trees that are correctly rendered, MathSemantifier produces correct results with a very high
probability.</p>
      <p>Let us look into the reasons that make MathSemantifier fail. First of all, the reasons that
do not depend on the system itself.</p>
      <p>1. The Presentation Algorithm fails to render CMML into PMML properly in about 30-40%
of the cases.
2. About 10% of the examples include symbols from archives that were not included in the</p>
      <p>CFG
Up to this point, if we exclude the above mentioned examples, the effective success rate of
MathSemantifier is about 80-90%.</p>
      <p>1. Some notations are omitted because they create cycles in the CFG
2. Input data that is too long or complex having more than 1000 parse trees
3. Assuming the input is more knowledge-rich than it actually is
While the effective success rate of 80-90% is inspiring for a proof of concept system, the main
issue of MathSemantifier is simply poor performance on long or complex input. Therefore,
MathSemantifier is a successful proof of concept, but not yet a practical tool.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>The conclusion of this study is the proof of concept architecture and implementation of a system
capable of converting Presentation MathML to all the possible meanings, which is a list of
Content MathML. The testing revealed that the system is able to recognize correctly single
top level symbols, as well as the whole set of readings of expressions with less than 1000 parse
trees (this is not the limit, but no testing is done beyond that). Naturally, MathSemantifier
can only recognize the symbols that were used when building the Context Free Grammar it uses.
This shows that it is certainly possible to aggregate the knowledge from the MMT Notations
and to use it for parsing purposes. However, both the Notations and the Parsing Framework
need significant improvements in order for the system to be scalable beyond what is presented
in the previous section. The most important part of this study is, therefore, the optimizations
and heuristics used, and other techniques presented below that further research could benefit
from.
[Rab] Florian Rabe. The MMT API: A Generic MKM System.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [ABC+10]
          <string-name>
            <surname>Ron</surname>
            <given-names>Ausbrooks</given-names>
          </string-name>
          , Stephen Buswell, David Carlisle,
          <string-name>
            <given-names>Giorgi</given-names>
            <surname>Chavchanidze</surname>
          </string-name>
          , Stéphane Dalmas, Stan Devitt, Angel Diaz, Sam Dooley, Roger Hunter, Patrick Ion, Michael Kohlhase, Azzeddine Lazrek, Paul Libbrecht, Bruce Miller, Robert Miner, Murray Sargent,
          <string-name>
            <given-names>Bruce</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Neil</given-names>
            <surname>Soiffer</surname>
          </string-name>
          , Robert Sutor, and Stephen Watt.
          <source>Mathematical Markup Language (MathML) version 3.0. W3C Recommendation, World Wide Web Consortium (W3C)</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>[arX] arXiv. Monthly Submissions Statistics. submissions. [Online; accessed 8-May-2016].</mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [HPK14]
          <string-name>
            <given-names>Radu</given-names>
            <surname>Hambasan</surname>
          </string-name>
          , Corneliu C.
          <article-title>Prodescu, and Michael Kohlhase</article-title>
          .
          <source>MathWebSearch at NTCIR-11</source>
          .
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [Keg]
          <string-name>
            <given-names>Jeffrey</given-names>
            <surname>Kegler</surname>
          </string-name>
          .
          <article-title>The Marpa Parser</article-title>
          . https://jeffreykegler.github.io/ Marpa-web-site/. [Online; accessed 19-April-2016].
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [KMR08]
          <string-name>
            <given-names>Michael</given-names>
            <surname>Kohlhase</surname>
          </string-name>
          , Christine Müller, and
          <string-name>
            <given-names>Florian</given-names>
            <surname>Rabe</surname>
          </string-name>
          .
          <article-title>Notations for living mathematical documents</article-title>
          .
          <source>In Serge Autexier</source>
          , John Campbell, Julio Rubio, Volker Sorge, Masakazu Suzuki, and Freek Wiedijk, editors,
          <source>Intelligent Computer Mathematics, number 5144 in LNAI</source>
          , pages
          <fpage>504</fpage>
          -
          <lpage>519</lpage>
          . Springer Verlag,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [Koh]
          <string-name>
            <given-names>Michael</given-names>
            <surname>Kohlhase</surname>
          </string-name>
          .
          <source>The SMGLOM Archive</source>
          . https://mathhub.info/smglom. [Online; accessed 4-May-2016].
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [Koh08]
          <string-name>
            <given-names>Michael</given-names>
            <surname>Kohlhase</surname>
          </string-name>
          .
          <article-title>Using LATEX as a semantic markup format</article-title>
          .
          <source>Mathematics in Computer Science</source>
          ,
          <volume>2</volume>
          (
          <issue>2</issue>
          ):
          <fpage>279</fpage>
          -
          <lpage>304</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [KT]
          <article-title>KWARC and Ion Toloaca. MathSemantifier Parsing Backend</article-title>
          . https://svn.kwarc. info/repos/MMT/. [Online; accessed 4-May-2016].
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <source>[KWA] KWARC. MathHub Glossary</source>
          . https://mathhub.info/mh/glossary. [Online; accessed 1-May-2016].
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [Mil]
          <string-name>
            <given-names>Bruce</given-names>
            <surname>Miller</surname>
          </string-name>
          .
          <article-title>LaTeXML: A LATEX to XML converter</article-title>
          . Web Manual at http://dlmf. nist.gov/LaTeXML/.
          <source>seen September</source>
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>