<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Normalization of Digital Mathematics Library Content</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>MathML Canonicalization</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Masaryk University, Faculty of Informatics Botanická 68a</institution>
          ,
          <addr-line>602 00 Brno</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Paper discusses the needs for data normalization in a Digital Mathematics Library (DML). Specifically, emphasis is given to canonicalizing formulae encoded in Presentation MathML notation which starts to be available in several DMLs and is used by DML applications. This is a prerequisite for advanced processing - namely math enabled fulltext searching or semantic filtering and automated classification. Diferent sources of MathML and their specifics are described. Several use cases of possible formulae canonicalization transformations are listed and discussed in detail. Findings are finally concluded and a design of a to-be-developed canonicalization tool is outlined.</p>
      </abstract>
      <kwd-group>
        <kwd>MathML normalization</kwd>
        <kwd>canonicalization</kwd>
        <kwd>digital mathematics libraries</kwd>
        <kwd>DML</kwd>
        <kwd>presentation MathML</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Modern Digital Mathematics Libraries (DML) such as EuDML [
        <xref ref-type="bibr" rid="ref18 ref5">18,5</xref>
        ] base their
services on paper semantics, i.e. fulltext handling, including mathematical
formulae, as well as basic metadata and Mathematics Subject Classification (MSC)
codes. Mathematics literature is widely dispersed across a high number of
publishers, making it very dificult to collect fulltexts from these heterogeneous
sources. This situation is very diferent from other libraries, such as PubMed
Central for biomedical and life sciences, where publishers have an agreed
worklfow using the NLM Journal Publishing Tag Set and tools developed with funding
from the National Institutes of Health.
      </p>
      <p>
        Full paper texts have to be ‘homogenized’, converted to some uniform
representation, in order for math-aware full-text searches [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] and paper similarity
computations [
        <xref ref-type="bibr" rid="ref11 ref12">11,12</xref>
        ] to work properly. These tasks are usually handled based
on a bag-of-words representation of a document text — vector space model —
every term (word, lemma) has its own dimension and the number of occurrences
of a term reflects its value. Non-textual terms such as mathematical formulae
are mostly not taken into account. This creates another challenge for DMLs, as
mathematical formulae are the essence of mathematical publications. There is an
average of 380 mathematical formulae per arXiv paper in the MREC database [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
It has been reported [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] that even a single histogram of mathematical symbols
is suficient for domain classification of a paper in the mathematical domain.
      </p>
      <p>To reliably represent a paper for DML processing, including handling the
mathematics, it is necessary to
1. select a canonical representation of the non-textual structural entities
appearing in fulltexts (mathematical symbols, formulae, and equations); and
2. decide on equivalence classes for these entities (e.g., for which formulae
should be considered equal for given DML tasks such as search, similarity
computation, formulae editing, and conversion of math into Braille).
In this paper, we discuss the options for selecting the canonical representations
of formulae to be used in DML tools, and the canonicalization process — the
process — of computing this canonical representation from a variety of diferent
sources and formats.</p>
      <p>
        Our primary motivation is the natural requirement for our own (Web)MIaS
system, which currently uses Presentation MathML [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] to operate correctly and
ofer an expected search behaviour to users regardless of the MathML input
source. When a user posts a query to the system, the system must abstract it
from the underlying notational diferences in order for it to behave correctly. This
requirement is increasingly emphasized with the growing number of diferent
sources of MathML. Currently there are three sources (LATEXML, Tralics, and user
input; the number is expected to increase). If they are not correctly normalized
the system misbehaves and it appears to users as if it simply does not work,
however good the underlying design is.
      </p>
      <p>
        We have used UMCL library [
        <xref ref-type="bibr" rid="ref1 ref2">1,2</xref>
        ] for canonicalization in our MIaS system
sofar. However, we have found that the deficiencies of the software are so severe
(change of formulae semantics, slowness,. . . ) [7, chapter 5], and the need for
canonicalization so important, that we have decided to design and implement
new canonicalization tool from scratch.
      </p>
      <p>This paper is structured as follows: in Section 2, diferent sources of
mathematics are described and their diferences are discussed. The core part of this
paper is Section 3, where several use cases of possible canonical representation
and canonicalization are documented and suggested. We conclude with Section 5,
and present a plan for future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>MathML Sources</title>
      <p>To store mathematical formulae in our documents we have chosen MathML1 —
an XML-based language — as a widely used, formally defined, but still evolving
standard. The widespread use of MathML and its XML base means of this
1 More precisely, Presentation MathML, as there are currently significantly more real-life
resources using this form of MathML than Content MathML.
language is supported by various tools in the whole document workflow. More
importantly, MathML can be used as a common language among the advanced
computer mathematical software packages that are extensively used by working
mathematicians.</p>
      <p>On the author end of the document workflow the MathML code can be
‘hand made’ using simple plain text editors such as MS Windows Notepad, or
something more comfortable, such as specialized XML editors that are usually
part of various integrated development environments. For example, the formula
2 + 2 can be written as follows:
&lt;math xmlns=’http://www.w3.org/1998/Math/MathML’&gt;
&lt;msup&gt;</p>
      <p>&lt;mi&gt;x&lt;/mi&gt;&lt;mn&gt;2&lt;/mn&gt;
&lt;/msup&gt;
&lt;mo&gt;+&lt;/mo&gt;
&lt;msup&gt;</p>
      <p>&lt;mi&gt;y&lt;/mi&gt;&lt;mn&gt;2&lt;/mn&gt;
&lt;/msup&gt;
&lt;/math&gt;
Listing 1: Example of the ‘hand made’ formula 2 + 2</p>
      <p>
        However, the XML nature of MathML makes the coding of more complex
formulae rather long for manual construction. Various software tools are more
frequent sources of MathML. MathML can be generated as an output / data
exchange format of complex specialized programs, such as Maple, Matlab, and
Mathematica [
        <xref ref-type="bibr" rid="ref20 ref22 ref9">9,20,22</xref>
        ], or web services, such as the well known Wolfram
Alpha [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ], that are extensively used by mathematicians to support their work.
generate::MathML(x^2 + y^2,
      </p>
      <p>Content = FALSE, Annotation = FALSE)
&lt;math xmlns=’http://www.w3.org/1998/Math/MathML’&gt;
&lt;mrow xref=’No7’&gt;
&lt;msup xref=’No3’&gt;
&lt;mi xref=’No1’&gt;x&lt;/mi&gt;
&lt;mn xref=’No2’&gt;2&lt;/mn&gt;
&lt;/msup&gt;
&lt;mo&gt;+&lt;/mo&gt;
&lt;msup xref=’No6’&gt;
&lt;mi xref=’No4’&gt;y&lt;/mi&gt;
&lt;mn xref=’No5’&gt;2&lt;/mn&gt;
&lt;/msup&gt;
&lt;/mrow&gt;
&lt;/math&gt;
Listing 2: Example of MathML export of the formula 2 + 2 by Matlab 7.9.0
MuPAD symbolic engine
&lt;math xmlns=’http://www.w3.org/1998/Math/MathML’&gt;
&lt;mrow&gt;
&lt;msup&gt;
&lt;mi&gt;x&lt;/mi&gt;
&lt;mn&gt;2&lt;/mn&gt;
&lt;/msup&gt;
&lt;mo&gt;+&lt;/mo&gt;
&lt;msup&gt;
&lt;mi&gt;y&lt;/mi&gt;
&lt;mn&gt;2&lt;/mn&gt;
&lt;/msup&gt;
&lt;/mrow&gt;
&lt;/math&gt;
Listing 3: Example of the MathML export of the Wolfram Alpha input query
‘x^2 + y^2’</p>
      <p>On the consumer end of the document workflow MathML can be used as an
input for mathematical programs and services (Maple, Matlab, Mathematica,
Wolfram Alpha, etc.) or simply displayed — usually as part of an XHTML web
page — in a web browser with MathML support.</p>
      <p>However, a large number of mathematical documents are produced using
the TEX typesetting system and authored in TEX markup. Thus, it is necessary to
be able to convert the TEX source code of mathematical formulae to the MathML
language. Our main motivation is the WebMIaS system. For more complex input
formulae, it would be uncomfortable for the user to manually construct queries
in MathML, as the code would be very complicated. The well known LATEX syntax
is far more appropriate for manual input. Therefore, we need a conversion from
LATEX to MathML as part of the WebMIaS input routine.</p>
      <p>
        There are several tools that are able to convert TEX markup to the MathML
language. For example, arXMLiv [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] employs LATEXML [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. The EuDML project
and our WebMIaS [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] system internally use Tralics [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
&lt;math xmlns="http://www.w3.org/1998/Math/MathML"
alttext="x^{2}+y^{2}" display="inline"&gt;
&lt;semantics&gt;
&lt;mrow&gt;
&lt;msup&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msup&gt;
&lt;mo&gt;+&lt;/mo&gt;
&lt;msup&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msup&gt;
&lt;/mrow&gt;
&lt;annotation encoding="application/x-tex"&gt;
      </p>
      <p>x^{2}+y^{2}
&lt;/annotation&gt;
&lt;/semantics&gt;
&lt;/math&gt;
Listing 4: Example of LATEXML generated MathML of formula 2 + 2
&lt;math xmlns=’http://www.w3.org/1998/Math/MathML’&gt;
&lt;mrow&gt;
&lt;msup&gt;</p>
      <p>&lt;mi&gt;x&lt;/mi&gt; &lt;mn&gt;2&lt;/mn&gt;
&lt;/msup&gt;
&lt;mo&gt;+&lt;/mo&gt;
&lt;msup&gt;</p>
      <p>&lt;mi&gt;y&lt;/mi&gt; &lt;mn&gt;2&lt;/mn&gt;
&lt;/msup&gt;
&lt;/mrow&gt;
&lt;/math&gt;
Listing 5: Example of Tralics generated MathML of formula 2 + 2</p>
      <p>A frequent type of mathematical document in DML is the older papers that
are unavailable in any digital-format or are available only in an ‘end’ format
such as PDF that is suitable for reading and printing but is not appropriate for
direct MathML processing. These documents can be a significant part of the
DML content collection, so they are worth further processing.</p>
      <p>
        Documents available in hard copy only can be scanned and processed using
InftyReader [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] optical character recognition (OCR) software. InftyReader has
a unique feature for detecting mathematical formulae in a scanned document.
These formulae can be subsequently saved as MathML.
&lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;
&lt;msup&gt;
&lt;mi mathvariant="italic"&gt;x&lt;/mi&gt;
&lt;mrow&gt;
      </p>
      <p>&lt;mn mathvariant="normal"&gt;2&lt;/mn&gt;
&lt;/mrow&gt;
&lt;/msup&gt;
&lt;mo mathvariant="normal"&gt;+&lt;/mo&gt;
&lt;msup&gt;
&lt;mi mathvariant="italic"&gt;y&lt;/mi&gt;
&lt;mrow&gt;</p>
      <p>&lt;mn mathvariant="normal"&gt;2&lt;/mn&gt;
&lt;/mrow&gt;
&lt;/msup&gt;
&lt;/math&gt;
Listing 6: Example of InftyReader generated MathML from a PDF document
containing only formula the 2 + 2 in its body</p>
      <p>
        Born-digital PDF documents with no available source codes can be processed
using the MaxTract software [
        <xref ref-type="bibr" rid="ref3 ref4">3,4</xref>
        ], which that is under intensive development as
part of the EuDML project. MaxTract generates LATEX source / XHTML+MathML
representation of the document based on an optical analysis of the positions of
characters on the page. The analysis is supported with information from the
fonts embedded in the processed document.
&lt;math display="block" xmlns="&amp;mathml;"&gt;
&lt;mi&gt;&amp;#x0078;&lt;/mi&gt;
&lt;/math&gt;
&lt;p &gt;
&lt;/p&gt;
&lt;p &gt;
&lt;/p&gt;
&lt;p align="right" &gt;
&lt;math display="inline" xmlns="&amp;mathml;"&gt;
&lt;mi&gt;&amp;#x0079;&lt;/mi&gt;
&lt;/math&gt;
&lt;/p&gt;
Listing 7: Example of XHTML + MathML generated by the development version
of MaxTract from a PDF document containing only the formula 2 + 2 in its
body
      </p>
      <p>
        During the MathDex project, it became clear that the most time- and
resourcesconsuming task in building a math search engine and database is the
normalization and conversion of heterogeneous sources [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. As shown in Listings 1 — 6,
MathML can vary slightly due to the diferent ways a code was obtained, even
for a trivial formula like 2 + 2.
      </p>
      <p>In a DML project, there can be diferences in the final MathML encoding
even for semantically and structurally similar formulae, due to the origins of the
MathML from diferent sources. In Section 3, several more complicated examples
of possible ambiguities in MathML are discussed that have to be normalized to
allow math searches and similarity computation.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Use Cases</title>
      <p>Using our public working demo of the WebMIaS system we discovered several
discrepancies in the form of MathML generated by the real-time TEX to MathML
converter we currently use — Tralics — and by the MathML canonicalizer from
the UMCL library. We employed the UMCL canonicalization module to try to
normalize the users’ MathML input and the MathML produced by the LATEXML
converter contained in the arXMLiv collection. Then we went through the
Presentation MathML specifications and gathered a list of possible reformatting
rules we could perform.</p>
      <p>The goal is to reduce the possible MathML scripts with the same semantics
and mathematical structures to just one representation. To have such a
canonicalized representation is convenient for many applications, as was described in
Sections 1 and 2.</p>
      <p>Analyzing the issues of possible inconsistencies and ambiguities of
MathMLencoded formulae raised design and strategy questions. Conceptual decisions
for handling diferent types of similar constructions and completely diferent
formulae need to be made.</p>
      <p>More specifically, for example, should we try to keep the MathML compact
and reduce the number of nodes in transformations, or should we try to add
nodes for better disambiguation? Another question is: should our future
canonicalization tool produce valid MathML according to this schema? Unquestionably,
this feature would be nice to have for many reasons and possible applications,
but it certainly adds more requirements and takes much more efort to design
and implement not only true/false validation, but also functional correctness
validation.</p>
      <p>Below are described proposals and discussions of transformations that can be
performed with relatively minor dificulty. The list is not complete and is subject
to further evaluation.</p>
      <sec id="sec-3-1">
        <title>3.1 Removing Elements and Attributes</title>
        <p>Many of the MathML elements used in Presentation MathML make little or no
contribution to the semantics of the formula and therefore also to the formulae
for indexing and searching. These are usually elements that alter the
appearance of formulae in some way — space-like elements such as mspace, mpadded,
mphantom, maligngroup, and malignmark. They may occasionally have some
semantic meaning, but we prefer to canonicalize similar formulae into one
representation rather than risk treating the same formulae as diferent. Therefore,
these elements are best omitted. The content of the mtext element should be
indexed as normal text before removal.</p>
        <p>Most element attributes are similarly undesirable. Many are used for
formatting, afecting only the appearance of rendered formulae (for example, the
attributes linebreak and indentalign of the mo element). Others might have
some slight semantic significance, but are very uncommon and usually not very
important; we think these attributes should be removed. However, several
exceptions exist. For instance, the element mfrac is used for fractions but its meaning
changes with the attribute linethickness set to 0, which express a binomial
coeficient. The attributes of the element mfenced are also important (see
Listing 9). The attribute mathvariant can also influence formula semantics and
therefore should be preserved in all possible elements. For example, the MIaS
system makes use of this attribute so that hits with the assigned mathvariant
font specifying the attribute are more relevant.</p>
        <p>&lt;mfrac&gt;
&lt;mrow&gt;
&lt;mi&gt; x &lt;/mi&gt;
&lt;mo&gt; + &lt;/mo&gt;
&lt;mi&gt; y &lt;/mi&gt;
&lt;mo&gt; + &lt;/mo&gt;
&lt;mi&gt; z &lt;/mi&gt;
&lt;/mrow&gt;
&lt;mrow&gt;
&lt;mi&gt; x &lt;/mi&gt;
&lt;mo&gt; + &lt;/mo&gt;
&lt;mi&gt; z &lt;/mi&gt;
&lt;/mrow&gt;
&lt;/mfrac&gt;
&lt;mfrac&gt;
&lt;mi&gt; a &lt;/mi&gt;
&lt;mi&gt; b &lt;/mi&gt;
&lt;/mfrac&gt;
Listing 8: Example of &lt;mphantom&gt; ommision
&lt;mfrac linethickness="2"</p>
        <p>bevelled="true"&gt;
&lt;mi&gt; a &lt;/mi&gt;
&lt;mi&gt; b &lt;/mi&gt;
&lt;/mfrac&gt;
Listing 9: Example of omission of unnecessary attributes in mfrac</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2 Unifying Fences</title>
        <p>There are two approaches to creating fenced formulae. One is more semantic
and uses the mfenced element with the open, close, and separator attributes
to describe delimiters and separators. The other places fence symbols directly
within mo elements, and the fenced formula is enclosed in the mrow element to
group the elements together. Although the first approach seems to be valid, we
prefer the second one as it is more universal and allows easier conversion — e.g.,
converting addition to mfenced with attribute separators set to + would be
invalid. As shown in Listing 10, mfenced elements are replaced by a more general
mrow element, and fence and separator symbols are added as mo elements. Fenced
elements are further enclosed in an mrow element so it can be treated as a single
expression when needed. We could also consider unifying the symbols used as
separators/delimiters.
The mrow element is used for grouping other elements. Its most common use case
is to obtain a given correct number of child elements of some parent element (e.g.
mfrac needs two child elements). We can determine unnecessary occurrences of
mrow by summing the number of its child elements and its siblings with respect
to the number of required elements for the parent element. Parents requiring
only one child element actually accept any number of elements that are treated
as if they are inferred within a single mrow element. Hence, the grouping element
is redundant and can be removed. In any case, the impact of the transformations
to any form of processing canonicalized notation must be taken into account and
the structure of the formulae cannot be violated. For instance, after removing the
mfenced enclosing element we ought to wrap the fenced formula with an mrow
if it is not.
&lt;msqrt&gt;
&lt;mrow&gt;
&lt;mo&gt; - &lt;/mo&gt;
&lt;mn&gt; 1 &lt;/mn&gt;
&lt;/mrow&gt;
&lt;/msqrt&gt;
&lt;msqrt&gt;
&lt;/msqrt&gt;
&lt;mo&gt; - &lt;/mo&gt;
&lt;mn&gt; 1 &lt;/mn&gt;
Listing 11: Example of &lt;mrow&gt; removal after optimization √− 1</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.4 Sub-/Superscripts Handling</title>
        <p>The msubsup element used for attaching subscript and superscript to another
element at the same time is redundant — the same thing can be expressed as
a combination of msub and msup elements. The order of the elements is important.
When both elements are used, we prefer to place msub within msup (see
Listing 12) because a subscript is usually more closely related to the base expression.
A similar problem and solution is related to the elements triad of munder, mover,
and munderover. Both msubsup and munderover can be used for limits of
integration or bounds of summations; therefore, we should use only one canonical
representant.
&lt;msubsup&gt;
&lt;mi&gt; x &lt;/mi&gt;
&lt;mn&gt; 1 &lt;/mn&gt;
&lt;mn&gt; 2 &lt;/mn&gt;
&lt;/msubsup&gt;
&lt;msup&gt;
&lt;msub&gt;
&lt;mi&gt; x &lt;/mi&gt;
&lt;mn&gt; 1 &lt;/mn&gt;
&lt;/msub&gt;
&lt;mn&gt; 2 &lt;/mn&gt;
&lt;/msup&gt;
Listing 12: Two ways of expressing 21</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.5 Applying Functions</title>
        <p>There are many ways to express functions. Entity &amp;#x2061; (function application)
should be used but we cannot rely on that, so we suggest removing this operator
for the purpose of unification. The opposite approach — adding the function
application operator where it was omitted — could be rather tricky and could
lead to ambiguities. The name of the function should occur in the mi element
but it also can be considered as an operator and be placed in the mo element.
The arguments of a function can be fenced with parentheses or an mfenced
element or both. We chose canonical representation without an entity, with mrow
and parentheses (see Listing 14). Other ambiguities can be caused by diferent
invisible operators. For example, two identifiers in a subscript with no operator
usually means multiplication but it can mean separation too.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Design Considerations</title>
      <p>The design and implementation decisions of the canonicalization application
depend on the purpose of new canonicalizer. Even though the use of the math
content by diferent tools might be similar, the experience shows that we hardly
could ‘fit one size’ for all applications. Thus the main design imperative is the
modularity, simplicity, extensibility and flexibility, so that the canonicalizer might
be easily modified when the need of the applications change. With diferent data
the canonicalizer might change even for diferent types of math-aware search.
&lt;mi&gt; f &lt;/mi&gt;
&lt;mo&gt; &amp;#x2061; &lt;/mo&gt;
&lt;mrow&gt;
&lt;mo&gt; ( &lt;/mo&gt;
&lt;mi&gt; x &lt;/mi&gt;
&lt;mo&gt; ) &lt;/mo&gt;
&lt;/mrow&gt;
&lt;mi&gt; sin &lt;/mi&gt;
&lt;mo&gt; &amp;#x2061; &lt;/mo&gt;
&lt;mi&gt; f &lt;/mi&gt;
&lt;mrow&gt;
&lt;mo&gt; ( &lt;/mo&gt;
&lt;mi&gt; x &lt;/mi&gt;
&lt;mo&gt; ) &lt;/mo&gt;
&lt;/mrow&gt;
&lt;mi&gt;sin&lt;/mi&gt;
&lt;mrow&gt;
&lt;mo&gt;(&lt;/mo&gt;
&lt;mi&gt;x&lt;/mi&gt;
&lt;mo&gt;)&lt;/mo&gt;
&lt;/mrow&gt;
Listing 14: Adding parentheses to sine function argument</p>
      <p>Examples in subsections of previous section form set of modules that do the
necessary MathML tree transformations as recursive procedures on MathML
trees.</p>
      <p>
        According to the expected size of the input data set, efectiveness, the speed
of the canonicalization application is also a critical parameter — in our MREC [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
corpora there is 168,000,000 formulae to canonicalize. Thus, use of standard XSL
transformations does not seem to be appropriate, for example, as UMCL example
showed.
      </p>
      <p>Another key decision is handling of invalid input MathML and question of
valid MathML on the output as mentioned in Section 3.</p>
      <p>As the (Web)MIaS system as well as other core parts of EuDML system
(Lucene) do use the Java platform is seems to be natural to use Java also for the
implementation of canonicalization application.</p>
    </sec>
    <sec id="sec-5">
      <title>5 Conclusions and Future Work</title>
      <p>We consider MathML canonicalization important for proper functioning of
several math-aware applications that handle documents in DMLs. We have defined
the problems and enumerated the most important use cases as modules of newly
designed canonicalizer.</p>
      <p>
        We are currently working on finishing the design and implementation of a
ifrst version of application that will be used for the task of math indexing in MIaS
system employed in EuDML project. By evaluation of this task we will verify our
design decisions and plan to use it for another tools working with math fulltext
data (semantic similarity tools as gensim [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]).
      </p>
      <p>Acknowledgements This work was partially supported by the European Union
through its Competitiveness and Innovation Programme (Information and
Communication Technologies Policy Support Programme, ‘Open access to scientific
information’, Grant Agreement No. 250503, a project of the European Digital
Mathematics Library, EuDML).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Archambault</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Berger</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moço</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Overview of the “Universal Maths Conversion Library”</article-title>
          . In: Pruski,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Knops</surname>
          </string-name>
          , H. (eds.) Assistive Technology: From Virtuality to Reality
          <source>: Proceedings of 8th European Conference for the Advancement of Assistive Technology in Europe AAATE</source>
          <year>2005</year>
          , Lille, France. pp.
          <fpage>256</fpage>
          -
          <lpage>260</lpage>
          . IOS Press, Amsterdam, The Netherlands (
          <year>Sep 2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Archambault</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moço</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Canonical MathML to Simplify Conversion of MathML to Braille Mathematical Notations</article-title>
          . In: Miesenberger,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Klaus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Zagler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            ,
            <surname>Karshmer</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . (eds.)
          <source>Computers Helping People with Special Needs, Lecture Notes in Computer Science</source>
          , vol.
          <volume>4061</volume>
          , pp.
          <fpage>1191</fpage>
          -
          <lpage>1198</lpage>
          . Springer Berlin / Heidelberg (
          <year>2006</year>
          ), http://dx. doi.org/10.1007/11788713_
          <fpage>172</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Baker</surname>
            ,
            <given-names>J.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sexton</surname>
            ,
            <given-names>A.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sorge</surname>
          </string-name>
          , V.:
          <article-title>A linear grammar approach to mathematical formula recognition from PDF</article-title>
          .
          <source>In: Proceedings of the Conferences in Intelligent Computer Mathematics</source>
          ,
          <string-name>
            <surname>CICM</surname>
          </string-name>
          <year>2009</year>
          .
          <article-title>LNAI</article-title>
          , vol.
          <volume>5625</volume>
          , pp.
          <fpage>201</fpage>
          -
          <lpage>216</lpage>
          . Springer (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Baker</surname>
            ,
            <given-names>J.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sexton</surname>
            ,
            <given-names>A.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sorge</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Towards reverse engineering of PDF documents</article-title>
          . In: Sojka,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Bouche</surname>
          </string-name>
          , T. (eds.)
          <article-title>Towards a Digital Mathematics Library</article-title>
          ,
          <string-name>
            <surname>DML</surname>
          </string-name>
          <year>2011</year>
          . pp.
          <fpage>65</fpage>
          -
          <lpage>75</lpage>
          . Masaryk University Press, Bertinoro,
          <source>Italy (July</source>
          <year>2011</year>
          ), http://hdl.handle. net/10338.dmlcz/702603
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Borbinha</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bouche</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nowiński</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sojka</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <string-name>
            <surname>Project EuDML-A First Year</surname>
          </string-name>
          <article-title>Demonstration</article-title>
          . In: Davenport,
          <string-name>
            <given-names>J.H.</given-names>
            ,
            <surname>Farmer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.M.</given-names>
            ,
            <surname>Urban</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Rabe</surname>
          </string-name>
          ,
          <string-name>
            <surname>F</surname>
          </string-name>
          . (eds.) Intelligent
          <source>Computer Mathematics. Proceedings of 18th Symposium</source>
          ,
          <year>Calculemus 2011</year>
          ,
          <article-title>and</article-title>
          10th International Conference,
          <source>MKM 2011. Lecture Notes in Artificial Intelligence, LNAI</source>
          , vol.
          <volume>6824</volume>
          , pp.
          <fpage>281</fpage>
          -
          <lpage>284</lpage>
          . Springer-Verlag, Berlin, Germany (Jul
          <year>2011</year>
          ), http://dx.doi.org/10.1007/978-3-
          <fpage>642</fpage>
          -22673-1_
          <fpage>21</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Grimm</surname>
          </string-name>
          , J.:
          <article-title>Producing MathML with Tralics</article-title>
          .
          <source>In: Sojka [13]</source>
          , pp.
          <fpage>105</fpage>
          -
          <lpage>117</lpage>
          , http://dml. cz/dmlcz/702579
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Jarmar</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Conversion of Mathematical Documents into Braille</article-title>
          .
          <source>Master's thesis</source>
          ,
          <source>Faculty of Informatics (Jan</source>
          <year>2012</year>
          ), https://is.muni.cz/th/172981/fi_m/?lang=en
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Líška</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sojka</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Růžička</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mravec</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Web Interface and Collection for Mathematical Retrieval</article-title>
          . In: Sojka,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Bouche</surname>
          </string-name>
          , T. (eds.)
          <source>Proceedings of DML 2011</source>
          . pp.
          <fpage>77</fpage>
          -
          <lpage>84</lpage>
          . Masaryk University, Bertinoro,
          <source>Italy (Jul</source>
          <year>2011</year>
          ), http://www.fi.muni.cz/~sojka/ dml-2011-program.html
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9. Maplesoft,
          <article-title>a division of Waterloo Maple Inc</article-title>
          .: MathML - Maple
          <string-name>
            <surname>Help</surname>
          </string-name>
          (
          <year>Apr 2012</year>
          ), http://www.maplesoft.com/support/help/Maple/view.aspx?path=MathML
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Munavalli</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miner</surname>
          </string-name>
          , R.:
          <source>MathFind: A Math-Aware Search Engine. In: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval</source>
          . pp.
          <fpage>735</fpage>
          -
          <lpage>735</lpage>
          . SIGIR '06,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA (
          <year>2006</year>
          ), http://doi.acm.
          <source>org/10</source>
          .1145/1148170.1148348
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Řehůřek</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sojka</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <source>Automated Classification and Categorization of Mathematical Knowledge</source>
          . In: Autexier,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Campbell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Rubio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Sorge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            ,
            <surname>Suzuki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Wiedijk</surname>
          </string-name>
          ,
          <string-name>
            <surname>F</surname>
          </string-name>
          . (eds.) Intelligent
          <source>Computer Mathematics-Proceedings of 7th International Conference on Mathematical Knowledge Management MKM 2008. Lecture Notes in Computer Science LNCS/LNAI</source>
          , vol.
          <volume>5144</volume>
          , pp.
          <fpage>543</fpage>
          -
          <lpage>557</lpage>
          . Springer-Verlag, Berlin, Heidelberg (Jul
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Řehůřek</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sojka</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Software Framework for Topic Modelling with Large Corpora</article-title>
          .
          <source>In: Proceedings of LREC 2010 workshop New Challenges for NLP Frameworks</source>
          . pp.
          <fpage>45</fpage>
          -
          <lpage>50</lpage>
          . ELRA, Valletta, Malta (May
          <year>2010</year>
          ), http://is.muni.cz/publication/884893/en, software available at http://nlp.fi.muni.cz/projekty/gensim
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Sojka</surname>
          </string-name>
          , P. (ed.):
          <article-title>Towards a Digital Mathematics Library</article-title>
          . Masaryk University, Paris, France (
          <year>Jul 2010</year>
          ), http://www.fi.muni.cz/~sojka/dml-2010-program.html
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Sojka</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Líška</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <source>Indexing and Searching Mathematics in Digital Libraries (Mar</source>
          <year>2011</year>
          ), submitted to MKM 2011
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Sojka</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Líška</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Indexing and Searching Mathematics in Digital Libraries - Architecture, Design and Scalability Issues</article-title>
          . In: Davenport,
          <string-name>
            <given-names>J.H.</given-names>
            ,
            <surname>Farmer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.M.</given-names>
            ,
            <surname>Urban</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Rabe</surname>
          </string-name>
          ,
          <string-name>
            <surname>F</surname>
          </string-name>
          . (eds.) Intelligent
          <source>Computer Mathematics. Proceedings of 18th Symposium</source>
          ,
          <year>Calculemus 2011</year>
          ,
          <article-title>and</article-title>
          10th International Conference,
          <source>MKM 2011. Lecture Notes in Artificial Intelligence, LNAI</source>
          , vol.
          <volume>6824</volume>
          , pp.
          <fpage>228</fpage>
          -
          <lpage>243</lpage>
          . Springer-Verlag, Berlin, Germany (Jul
          <year>2011</year>
          ), http://dx.doi.org/10.1007/978-3-
          <fpage>642</fpage>
          -22673-1_
          <fpage>16</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Stamerjohanns</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kohlhase</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ginev</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>David</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miller</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Transforming Large Collections of Scientific Publications to XML</article-title>
          .
          <source>Mathematics in Computer Science</source>
          <volume>3</volume>
          ,
          <fpage>299</fpage>
          -
          <lpage>307</lpage>
          (
          <year>2010</year>
          ), http://dx.doi.org/10.1007/s11786-010-0024-7
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Suzuki</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tamari</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fukuda</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Uchida</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kanahori</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>INFTY - An integrated OCR system for mathematical documents</article-title>
          . In: Vanoirbeek,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Roisin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Munson</surname>
          </string-name>
          , E. (eds.)
          <source>Proceedings of ACM Symposium on Document Engineering</source>
          <year>2003</year>
          . pp.
          <fpage>95</fpage>
          -
          <lpage>104</lpage>
          . ACM, Grenoble, France (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Sylwestrzak</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Borbinha</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bouche</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nowiński</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sojka</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>EuDML-Towards the European Digital Mathematics Library</article-title>
          . In: Sojka [13], pp.
          <fpage>11</fpage>
          -
          <lpage>24</lpage>
          , http://dml.cz/ dmlcz/702569
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <article-title>The LaTeXML project: The LaTeXML Developer Portal</article-title>
          (
          <year>Apr 2012</year>
          ), https://trac. mathweb.org/LaTeXML/
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>The</surname>
            <given-names>MathWorks</given-names>
          </string-name>
          , Inc.: MuPAD - Matlab (May
          <year>2012</year>
          ), http://www.mathworks.com/ discovery/mupad.html
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Watt</surname>
            ,
            <given-names>S.M.</given-names>
          </string-name>
          :
          <article-title>Mathematical Document Classification via Symbol Frequency Analysis</article-title>
          . In: Sojka,
          <string-name>
            <surname>P</surname>
          </string-name>
          . (ed.)
          <source>Towards Digital Mathematics Library-Proceedings of DML 2008</source>
          . pp.
          <fpage>29</fpage>
          -
          <lpage>40</lpage>
          . Masaryk University, Birmingham,
          <string-name>
            <surname>UK</surname>
          </string-name>
          (Jul
          <year>2008</year>
          ), http://www.fi.muni. cz/~sojka/dml-2008-program.xhtml
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22. Wolfram: Mathematica Import/Export Format :
          <source>MathML (Apr</source>
          <year>2012</year>
          ), http:// reference.wolfram.com/mathematica/ref/format/MathML.html
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Wolfram Alpha</surname>
            <given-names>LLC</given-names>
          </string-name>
          :
          <article-title>Wolfram Alpha (Apr</article-title>
          <year>2012</year>
          ), http://www.wolframalpha.com/
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>