-

and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research

WIP: Creating a Database of Definitions From Large Mathematical Corpora

Luis Berlioz

0 0 University of Pittsburgh

2019

2 8 12

We propose a method to gather large amounts of definitions from mathematical documents available online. Recent work indicates that text classification algorithms can have excellent accuracy at determining when a certain paragraph is in fact a definition or not. These algorithms are trained on large math corpora available online like the arXiv website. The LATEX source code of these documents is first converted into a more structured format like XML or HTML with the software package LaTeXML. The training data for the classifier is then obtained by searching for the definitions that the author labeled with a LATEX macro. The second phase of the system consists of extracting the term being defined from each definition. This task is performed by a Named Entity Recognition (NER) model trained using data from websites with mathematical content. The data is finally organized according to several different properties like semantic similarity and content dependency.

Introduction ’Bott-Danilov-Steenbrink vanishing theorem’ ’p-good cover’ ’non-zerodivisor’ ’Frobenius submanifold’ ’holomorphic one-dimensional foliations’

’universal expansion’ ’non-toric purely log-terminal blow-up’ ’cubical algebra’ ’virtual bundle’ ’symplectic structure’ ’Banach manifold’ ’Loewy filtration’ ’graded-commutative product’ ’local Harbourne constant’

’donnés par’ ’Berenstein-Zelevinsky triangles’

’DL-gallery’ ’smooth lifting’ ’stalkwise fibration’ ’4-dimensional quadric’ The LATEX source from the arXiv has to be further processed before it becomes useful. This is done with the LaTeXML software package [8]. LaTeXML converts the TEX source first to XML and optionally to HTML by using an additionally script. For the purpose of identifying the definitions labeled by the author, the XML output is enough.

<theorem class="ltx_theorem_definition" inlist="thm theorem:definition" xml:id="Thmdefinition1"> <tags> <tag>Definition 1</tag> <tag role="refnum">1</tag> <tag role="typerefnum">Definition 1</tag> </tags> <title class="ltx_runin"><tag><text font="bold">Definition 1</text></tag>.</title> <para xml:id="Thmdefinition1.p1"> <p class="ltx_emph"><text font="italic">Let <Math mode="inline"

tex="k" text="k" xml:id="Thmdefinition1.p1.m1"> <XMath>

<XMTok role="UNKNOWN">k</XMTok> </XMath> Recent work indicates that well known text classification algorithms [ 1, 2 ] can have excellent accuracy at determining whether a given paragraph is in fact a definition. In [ 4 ] for example, a supervised learning method is first trained using word embeddings. These word embedding are created using the contents of the arXiv articles fed into a embedding algorithm like GloVe [11]. This has been implemented already and is available in [ 3 ]. Our system still does not use word embeddings for its classification, it is one of the main features we plan to add to the system in order to improve the classifier.

As training data for the classifier, we use the passages of certain articles that are labeled as definitions by the author by placing them in certain LATEX macro environments. These macros are normally defined in the preamble of the document using the \newtheorem macro. LaTeXML resolves the user defined macros and labels the corresponding XML tag in the output file like in figure 1.

In order to produce the negative examples, we randomly sample paragraphs out of the article and assume they are not definitions. This introduces some imperfections in the training set, because some of the selected paragraphs necessarily contain some definitions.

We have performed successful experiments using common general purpose algorithms implemented in the scikit–learn Python library [10]. And these were confirmed with the results shown on the website https: //corpora.mathweb.org/classify_paragraph. In table 2 we can observe the result of the classifier on some simple examples.

Text classifiers normally take each paragraph of an article and output an estimate of the probability of it being a definition or not. Figure 2 presents the basic performance metrics of the some of the classifiers implemented in the scikit-learn library. The Support Vector Classifier was observed to have the best performance and a more

Input to the Classifier

a banach space is defined as a complete vector space. This is not a definition honestly. even if it includes technical words like scheme and cohomology There is no real reason as to why this classifier is so good. a triangle is equilateral if and only if all its sides are the same length.

False True

detailed view of the result is pictured in table 3. In the future we plan to use the fasttext method [6] which has the best tradeoff between classification speed and accuracy.

nondefs definitions micro avg macro avg weighted avg precision 0.73 0.95 0.86 0.84 0.88 recall 0.91 0.84 0.86 0.87 0.86

F1-score support 0.81 2,217 0.89 4,661 0.86 0.85 0.87 After determining the definitions in the text, the system is required to find what is the term that is being defined in each definition. It is assumed that the definiendum is one or more adjacent words in the definition. This task can be interpreted as a Named Entity Recognition (NER) problem. Several different techniques have been developed to deal with it; as it is considered one of the most important subtasks of Information Extraction [9].

For the first approach to this problem, we used the ChunkParserI package from the NLTK library [7]. This module uses a supervised learning algorithm that is trained on examples of definitions tagged with part of speech (POS) and IOB. Each word in the definition is tagged with the token O for Outside, B–DFNDUM for the beginning of a definition and I–DFNDUM for the inside of definitions. Figure 3 specifies the order in which these tags are allowed to appear. The POS is obtained using the pretrained model included in the NLTK library.

After the training is done, the model tries to predict the IOB tags. In table 4 an example of a successful identification of the definiendum is shown.

start

O start

To obtain the tagged text, the whole body of text from Wikipedia was used. The examples of definitions were obtained by filtering the articles with the two following properties: • Articles that have a section with the word definition.

• The title of the article must appear at least once in this section.

These sections were assumed to be definitions and the title of the article which they belong to was assumed to be the definiendum. Only 5,229 articles were found matching this criteria (February 2019) out of the more than 6 million articles in the English Wikipedia. The dataset was split into training and test data, the results are shown in figure 5. When run on the definitions found on the algebraic geometry (math.AG) articles uploaded to arXiv on 2015, the results are pictured on figure 4.

Several difficulties were observed with this approach, for instance, many of the articles from Wikipedia are about topics completely unrelated to mathematics. Also, after stripping all the wiki markup from the text some IOB Accuracy: Precision: Recall: F-Measure: 91.1% 31.5% 67.6% 43.0%

Input to the Classifier

Let n 1. Recall that the lexicographic order l on Nn is defined by v = (v1; : : : ; vn) l (w1; : : : ; wn) = w if and only if either v = w or there is some i, 1 i n, with vj = wj , for all j in the range 1 j < i, and vi < wi. Then l is an admissible order on Nn in the sense of cite{BWK:98}. Indeed Nn, together with componentwise addition and l, forms a totally ordered abelian monoid. The lexicographic order l can be defined similarly on Zn, forming a totally ordered abelian group. (Upper semicontinuity of valuation) Let f be a nonzero element of k[x1; : : : ; xn] and let a 2 kn. Then there exists a neighbourhood V kn of a such that for all b 2 V vb(f ) l va(f ).

This claim concerns valuation-invariant lifting in relation to PL(A): it asserts that the condition, ‘each element of PL(A) is valuation-invariant in S’, is sufficient for an A-valuation-invariant stack in Rn to exist over S. Let f =g be a nonzero element of K, let U kn be an open set throughout which g 6= 0, and let a 2 U . Then there exists a neighbourhood V U of a such that for all b 2 V ordb(f =g) orda(f =g).

Result True Positive False Positive True Negative False Positive

The main objective of this article is to showcase the feasability of a system that can search for definitions and important terms in large bodies of mathematical text. This system should have extremely good classification performance in order to avoid the errors propagating to the NER system. On the other hand, the system needs to be fast enough to tackle large corpora such as all the mathematical articles in the arXiv. Considering the performance of the state of the art methods for text classification and NER available today, and after observing the performance of the current prototype, we believe that this system is possible.

The next step in order to use more sophisticated methods is to use word embeddings or language models. Methods that utilize these achieve better performance in both of the classification and NER tasks.

In order to further better the performance of the NER subtask we also plan on increasing the amount of training data. The technique we used for the Wikipedia data can be adapted to other websites that host similar content like The Stacks Project (https://stacks.math.columbia.edu/) and the Groupprops subwiki (https://groupprops.subwiki.org). Additionally, applying domain adaptation methods might help to improve performance in case that the labeled data deviates significantly from nonlabeled data [ 5 ].

[1]

Yoshua

Bengio , Réjean Ducharme, Pascal Vincent, and

Christian

Jauvin . A neural probabilistic language model . Journal of machine learning research , 3 (Feb): 1137 - 1155 , 2003 .

[2]

Tao

Chen , Ruifeng Xu,

Yulan

He , and

Xuan

Wang . Improving sentiment analysis via sentence type classification using bilstm-crf and cnn . Expert Systems with Applications , 72 : 221 - 230 , 2017 .

[3]

Deyan

Ginev . arxmliv:08 . 2018 dataset, an html5 conversion of arxiv .org, 2018 . SIGMathLing - Special Interest Group on Math Linguistics.

[4]

Deyan

Ginev . A web demo for scientific paragraph classification . https://github.com/dginev/web-scipara-demo, 2018 .

[5]

Jing

Jiang and ChengXiang Zhai . Instance weighting for domain adaptation in nlp . In Proceedings of the 45th annual meeting of the association of computational linguistics , pages 264 - 271 , 2007 .