<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>and E. Duchesnay. Scikit-learn:
Machine learning in Python. Journal of Machine Learning Research</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>WIP: Creating a Database of Definitions From Large Mathematical Corpora</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Luis Berlioz</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Pittsburgh</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <volume>2</volume>
      <fpage>8</fpage>
      <lpage>12</lpage>
      <abstract>
        <p>We propose a method to gather large amounts of definitions from mathematical documents available online. Recent work indicates that text classification algorithms can have excellent accuracy at determining when a certain paragraph is in fact a definition or not. These algorithms are trained on large math corpora available online like the arXiv website. The LATEX source code of these documents is first converted into a more structured format like XML or HTML with the software package LaTeXML. The training data for the classifier is then obtained by searching for the definitions that the author labeled with a LATEX macro. The second phase of the system consists of extracting the term being defined from each definition. This task is performed by a Named Entity Recognition (NER) model trained using data from websites with mathematical content. The data is finally organized according to several different properties like semantic similarity and content dependency.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Introduction
’Bott-Danilov-Steenbrink vanishing theorem’
’p-good cover’
’non-zerodivisor’
’Frobenius submanifold’
’holomorphic one-dimensional foliations’</p>
      <p>’universal expansion’
’non-toric purely log-terminal blow-up’
’cubical algebra’
’virtual bundle’
’symplectic structure’
’Banach manifold’
’Loewy filtration’
’graded-commutative product’
’local Harbourne constant’</p>
      <p>’donnés par’
’Berenstein-Zelevinsky triangles’</p>
      <p>’DL-gallery’
’smooth lifting’
’stalkwise fibration’
’4-dimensional quadric’
The LATEX source from the arXiv has to be further processed before it becomes useful. This is done with the
LaTeXML software package [8]. LaTeXML converts the TEX source first to XML and optionally to HTML by
using an additionally script. For the purpose of identifying the definitions labeled by the author, the XML output
is enough.</p>
      <p>&lt;theorem class="ltx_theorem_definition" inlist="thm theorem:definition" xml:id="Thmdefinition1"&gt;
&lt;tags&gt;
&lt;tag&gt;Definition 1&lt;/tag&gt;
&lt;tag role="refnum"&gt;1&lt;/tag&gt;
&lt;tag role="typerefnum"&gt;Definition 1&lt;/tag&gt;
&lt;/tags&gt;
&lt;title class="ltx_runin"&gt;&lt;tag&gt;&lt;text font="bold"&gt;Definition 1&lt;/text&gt;&lt;/tag&gt;.&lt;/title&gt;
&lt;para xml:id="Thmdefinition1.p1"&gt;
&lt;p class="ltx_emph"&gt;&lt;text font="italic"&gt;Let &lt;Math mode="inline"</p>
      <p>tex="k" text="k" xml:id="Thmdefinition1.p1.m1"&gt;
&lt;XMath&gt;</p>
      <p>
        &lt;XMTok role="UNKNOWN"&gt;k&lt;/XMTok&gt;
&lt;/XMath&gt;
Recent work indicates that well known text classification algorithms [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ] can have excellent accuracy at
determining whether a given paragraph is in fact a definition. In [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] for example, a supervised learning method is
first trained using word embeddings. These word embedding are created using the contents of the arXiv articles
fed into a embedding algorithm like GloVe [11]. This has been implemented already and is available in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Our
system still does not use word embeddings for its classification, it is one of the main features we plan to add to
the system in order to improve the classifier.
      </p>
      <p>As training data for the classifier, we use the passages of certain articles that are labeled as definitions by
the author by placing them in certain LATEX macro environments. These macros are normally defined in the
preamble of the document using the \newtheorem macro. LaTeXML resolves the user defined macros and labels
the corresponding XML tag in the output file like in figure 1.</p>
      <p>In order to produce the negative examples, we randomly sample paragraphs out of the article and assume
they are not definitions. This introduces some imperfections in the training set, because some of the selected
paragraphs necessarily contain some definitions.</p>
      <p>We have performed successful experiments using common general purpose algorithms implemented in the
scikit–learn Python library [10]. And these were confirmed with the results shown on the website https:
//corpora.mathweb.org/classify_paragraph. In table 2 we can observe the result of the classifier on some
simple examples.</p>
      <p>Text classifiers normally take each paragraph of an article and output an estimate of the probability of it being
a definition or not. Figure 2 presents the basic performance metrics of the some of the classifiers implemented
in the scikit-learn library. The Support Vector Classifier was observed to have the best performance and a more</p>
    </sec>
    <sec id="sec-2">
      <title>Input to the Classifier</title>
      <p>a banach space is defined as a complete vector space.
This is not a definition honestly. even if it includes
technical words like scheme and cohomology
There is no real reason as to why this classifier is so good.
a triangle is equilateral if and only if all its sides are the
same length.</p>
    </sec>
    <sec id="sec-3">
      <title>False</title>
    </sec>
    <sec id="sec-4">
      <title>True</title>
      <p>detailed view of the result is pictured in table 3. In the future we plan to use the fasttext method [6] which has
the best tradeoff between classification speed and accuracy.</p>
      <p>nondefs
definitions
micro avg
macro avg
weighted avg
precision
0.73
0.95
0.86
0.84
0.88
recall
0.91
0.84
0.86
0.87
0.86</p>
      <p>F1-score support
0.81 2,217
0.89 4,661
0.86
0.85
0.87
After determining the definitions in the text, the system is required to find what is the term that is being defined
in each definition. It is assumed that the definiendum is one or more adjacent words in the definition. This
task can be interpreted as a Named Entity Recognition (NER) problem. Several different techniques have been
developed to deal with it; as it is considered one of the most important subtasks of Information Extraction [9].</p>
      <p>For the first approach to this problem, we used the ChunkParserI package from the NLTK library [7]. This
module uses a supervised learning algorithm that is trained on examples of definitions tagged with part of speech
(POS) and IOB. Each word in the definition is tagged with the token O for Outside, B–DFNDUM for the
beginning of a definition and I–DFNDUM for the inside of definitions. Figure 3 specifies the order in which
these tags are allowed to appear. The POS is obtained using the pretrained model included in the NLTK library.</p>
      <p>After the training is done, the model tries to predict the IOB tags. In table 4 an example of a successful
identification of the definiendum is shown.</p>
      <p>start</p>
      <p>O
start</p>
      <p>B</p>
      <p>I</p>
      <p>To obtain the tagged text, the whole body of text from Wikipedia was used. The examples of definitions were
obtained by filtering the articles with the two following properties:
• Articles that have a section with the word definition.</p>
      <p>• The title of the article must appear at least once in this section.</p>
      <p>These sections were assumed to be definitions and the title of the article which they belong to was assumed to
be the definiendum. Only 5,229 articles were found matching this criteria (February 2019) out of the more than
6 million articles in the English Wikipedia. The dataset was split into training and test data, the results are
shown in figure 5. When run on the definitions found on the algebraic geometry (math.AG) articles uploaded to
arXiv on 2015, the results are pictured on figure 4.</p>
      <p>Several difficulties were observed with this approach, for instance, many of the articles from Wikipedia are
about topics completely unrelated to mathematics. Also, after stripping all the wiki markup from the text some
IOB Accuracy:
Precision:
Recall:
F-Measure:
91.1%
31.5%
67.6%
43.0%</p>
    </sec>
    <sec id="sec-5">
      <title>Input to the Classifier</title>
      <p>Let n 1. Recall that the lexicographic order l on Nn is defined by v =
(v1; : : : ; vn) l (w1; : : : ; wn) = w if and only if either v = w or there is some i,
1 i n, with vj = wj , for all j in the range 1 j &lt; i, and vi &lt; wi. Then
l is an admissible order on Nn in the sense of cite{BWK:98}. Indeed Nn,
together with componentwise addition and l, forms a totally ordered abelian
monoid. The lexicographic order l can be defined similarly on Zn, forming a
totally ordered abelian group.
(Upper semicontinuity of valuation) Let f be a nonzero element of k[x1; : : : ; xn]
and let a 2 kn. Then there exists a neighbourhood V kn of a such that for
all b 2 V vb(f ) l va(f ).</p>
      <p>This claim concerns valuation-invariant lifting in relation to PL(A): it asserts
that the condition, ‘each element of PL(A) is valuation-invariant in S’, is
sufficient for an A-valuation-invariant stack in Rn to exist over S.
Let f =g be a nonzero element of K, let U kn be an open set throughout
which g 6= 0, and let a 2 U . Then there exists a neighbourhood V U of a
such that for all b 2 V ordb(f =g) orda(f =g).</p>
    </sec>
    <sec id="sec-6">
      <title>Result</title>
    </sec>
    <sec id="sec-7">
      <title>True Positive</title>
    </sec>
    <sec id="sec-8">
      <title>False Positive</title>
    </sec>
    <sec id="sec-9">
      <title>True Negative</title>
    </sec>
    <sec id="sec-10">
      <title>False Positive</title>
      <p>The main objective of this article is to showcase the feasability of a system that can search for definitions and
important terms in large bodies of mathematical text. This system should have extremely good classification
performance in order to avoid the errors propagating to the NER system. On the other hand, the system needs
to be fast enough to tackle large corpora such as all the mathematical articles in the arXiv. Considering the
performance of the state of the art methods for text classification and NER available today, and after observing
the performance of the current prototype, we believe that this system is possible.</p>
      <p>The next step in order to use more sophisticated methods is to use word embeddings or language models.
Methods that utilize these achieve better performance in both of the classification and NER tasks.</p>
      <p>
        In order to further better the performance of the NER subtask we also plan on increasing the amount of
training data. The technique we used for the Wikipedia data can be adapted to other websites that host
similar content like The Stacks Project (https://stacks.math.columbia.edu/) and the Groupprops subwiki
(https://groupprops.subwiki.org). Additionally, applying domain adaptation methods might help to improve
performance in case that the labeled data deviates significantly from nonlabeled data [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Yoshua</given-names>
            <surname>Bengio</surname>
          </string-name>
          , Réjean Ducharme, Pascal Vincent, and
          <string-name>
            <given-names>Christian</given-names>
            <surname>Jauvin</surname>
          </string-name>
          .
          <article-title>A neural probabilistic language model</article-title>
          .
          <source>Journal of machine learning research</source>
          ,
          <volume>3</volume>
          (Feb):
          <fpage>1137</fpage>
          -
          <lpage>1155</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Tao</given-names>
            <surname>Chen</surname>
          </string-name>
          , Ruifeng Xu,
          <string-name>
            <given-names>Yulan</given-names>
            <surname>He</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Xuan</given-names>
            <surname>Wang</surname>
          </string-name>
          .
          <article-title>Improving sentiment analysis via sentence type classification using bilstm-crf and cnn</article-title>
          .
          <source>Expert Systems with Applications</source>
          ,
          <volume>72</volume>
          :
          <fpage>221</fpage>
          -
          <lpage>230</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Deyan</given-names>
            <surname>Ginev</surname>
          </string-name>
          .
          <source>arxmliv:08</source>
          .
          <article-title>2018 dataset, an html5 conversion of arxiv</article-title>
          .org,
          <year>2018</year>
          . SIGMathLing - Special Interest Group on Math Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Deyan</given-names>
            <surname>Ginev</surname>
          </string-name>
          .
          <article-title>A web demo for scientific paragraph classification</article-title>
          . https://github.com/dginev/web-scipara-demo,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Jing</given-names>
            <surname>Jiang and ChengXiang Zhai</surname>
          </string-name>
          .
          <article-title>Instance weighting for domain adaptation in nlp</article-title>
          .
          <source>In Proceedings of the 45th annual meeting of the association of computational linguistics</source>
          , pages
          <fpage>264</fpage>
          -
          <lpage>271</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>