<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Understanding Scientific Documents with Synthetic Analysis on Mathematical Expressions and Natural Language</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Takuto Asakura</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Informatics</institution>
          ,
          <addr-line>SOKENDAI</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <fpage>8</fpage>
      <lpage>12</lpage>
      <abstract>
        <p>Converting Science, Technology, Engineering, and Mathematics (STEM) documents to formal expressions has a large impact on academic and industrial society. It enables us to construct databases of mathematical knowledge, search for formulae, and develop a system that generates executable codes automatically. However, the conversion is an exceedingly ambitious goal. Mathematical expressions are commonly used in scientific communication in numerous fields such as mathematics and physics, and in many cases, they express key ideas in STEM documents. Despite the importance of mathematical expressions, formulae and texts are complementary to each other, and those in documents cannot be understood independently. Thus, deep synthetic analyses on natural language and mathematical expressions are necessary. To date, a large number of efforts have been made for developing Natural Language Processing (NLP) techniques, including semantic parsing [4], but their targets are mostly 'general' texts. Naturally, conventional NLP techniques include only limited features to treat formulae and numerous linguistic phenomena specific to STEM documents [3]. Meanwhile, semantics on mathematical expressions also has been deeply investigated. Such results can be seen in logic theories, MathML specification [1], etc. However, there is a large space between formal expressions such as first-order logic and actual formulae in natural language texts.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>There are a number of remaining works to achieve the conversion from STEM documents to a computational
form (Figure 1). At first, we are going to focus on the two foundational parts for the synthetic analyses. The
first one is token-level analyses on formulae. The main part of the analyses is associating formulae tokens to
mathematical objects and text fragments (Section 2.1). This is a primal step for the conversion, but it is still
almost untouched. The second one is the morphology of mathematical expression and semantics covering both
formulae and texts (Section 2.2). Studying underlying theories is essential to deeply understand the structure of
STEM document, and aim for the practical application by a bottom-up approach.
Copyright © by the paper’s authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).</p>
      <sec id="sec-1-1">
        <title>Mathematical</title>
        <p>expressions</p>
      </sec>
      <sec id="sec-1-2">
        <title>Natural language</title>
      </sec>
      <sec id="sec-1-3">
        <title>Applications</title>
        <p>Formula
analyses
Sentence
analyses
Token-level
analyses
Word-level
analyses</p>
      </sec>
      <sec id="sec-1-4">
        <title>Morphology &amp; Semantics</title>
        <p>difficulties often appear in formulae; giving an example for (1) as a representative, only in the first chapter of a
book Pattern Recognition and Machine Learning (PRML) [2], a character y (letter ‘y’ in bold roman) is used in
several meanings including a function, vectors, and a value (Table 1).</p>
        <p>The other part of the initial steps of understanding STEM documents is that connecting text fragments to
the subjective mathematical objects. Our hypothesis is that for this step, general NLP approaches such as
dependency parsing are more or less applicable. Of course, some tuning for STEM documents will be required,
and also this process might need to be done interactively with the mathematical object detection for formulae.
2.2</p>
        <p>Semantics and Morphology
Semantics on natural language and mathematical expressions have been studied separately. However, to
understand STEM document, it is important to investigate a synthetic semantics covering both of texts and formulae.</p>
        <p>Though morphology has been studied for natural languages, not so much for formulae. As a matter of fact,
in terms of morphology, words also exist in formulae. For instance, a token M is a word in “Matrix M ”, but M
is not a word in “An entry Mi;j ” (Mi;j is a word). Unlike morphemes in natural language, tokens in formulae
do not have lexical categories, but some symbols (e.g., parentheses and equal sign) and positional information
(e.g., superscript and subscript) have typical usages.
3</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Completed and Remaining Research</title>
      <p>For the beginning of our research, we simplified the detection task which we described in Section 2.1. Specifically,
we are giving annotations on some research papers in the following manner:
1. Detecting minimal groups of tokens (we call them chunks) each of which refers to a mathematical object
(chunking).</p>
      <p>2. Categorizing chunks by the mathematical object they referring to.</p>
      <p>This annotation (we call it pilot annotation) is the fundamental process to create the first gold dataset for
associating tokens and mathematical objects. The annotated data will also be helpful for investigating the
morphology on mathematical expressions.</p>
      <p>In other words, we defined a classification task before annotating descriptions for tokens. Since there are
many ways to describe a mathematical object, this classification can be done more coherently through the pilot
annotation. Moreover, we are expecting that the classification is naturally rather easier to be automated than
giving descriptions automatically for the first attempt.</p>
      <p>Besides the pilot annotation, all the works which have to be done to achieve our goal are remaining. For the
next step, we are planning to automate the annotation process by using features such as apposition nouns and
syntactic information in formulae. At the same time, we have to decide the form of mathematical objects. For
now, we can say that every mathematical object should have a description and some attributes such as types
(e.g., int and float). What attributes are necessary and sufficient is still not clear, and we will find it out after
trying the annotation for several documents.
4</p>
    </sec>
    <sec id="sec-3">
      <title>Publication Plans and Evaluation Plans</title>
      <p>Currently, we are creating a new language resource as the pilot annotation, and we are planning to publish it
for the community of language resources. For the further future, we will develop automation algorithms for
mathematical object detection, which are works suitable for NLP and digital mathematical library community,
including CICM. The analyses on underlying morphology and semantics are more like works in computational
linguistics.</p>
      <p>For the initiative dataset, it is better to make agreements among a few experts if possible. Following progress
on developing algorithms and analyses on linguistic phenomena should be evaluated with our handmade gold
datasets.</p>
      <p>Michael Kohlhase and Mihnea Iancu. “Co-Representing Structure and Meaning of Mathematical
Documents”. In: Sprache und Datenverarbeitung, International Journal for Language Data Processing 38.2 (2014).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Ron</given-names>
            <surname>Ausbrooks</surname>
          </string-name>
          et al.
          <source>Mathematical Markup Language (MathML) 3.0 Specification. World Wide Web Consortium (W3C)</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Christopher M Bishop.</surname>
          </string-name>
          <article-title>Pattern recognition and machine learning</article-title>
          . Springer,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Siva</given-names>
            <surname>Reddy</surname>
          </string-name>
          et al. “
          <article-title>Transforming Dependency Structures to Logical Forms for Semantic Parsing”</article-title>
          . In:
          <article-title>Transactions of the Association for Computational Linguistics (</article-title>
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>