=Paper=
{{Paper
|id=Vol-1186/paper-23
|storemode=property
|title=Mathematical Language Processing Project
|pdfUrl=https://ceur-ws.org/Vol-1186/paper-23.pdf
|volume=Vol-1186
|dblpUrl=https://dblp.org/rec/conf/mkm/PagelS14
}}
==Mathematical Language Processing Project==
Mathematical Language Processing
Project
Robert Pagel and Moritz Schubotz
Database Systems and Information Management Group,
Technische Universität Berlin, Einsteinufer 17, 10587 Berlin, Germany
rob@clabs.cc,schubotz@tu-berlin.de
http://demo.formulasearchengine.com/
Abstract. In natural language, words and phrases themselves imply the
semantics. In contrast, the meaning of identifiers in mathematical for-
mulae is undefined. Thus scientists must study the context to decode
the meaning. The Mathematical Language Processing (MLP) project
aims to support that process. In this paper, we compare two approaches
to discover identifier-definition tuples. At first we use a simple pattern
matching approach. Second, we present the MLP approach that uses
part-of-speech tag based distances as well as sentence positions to calcu-
late identifier-definition probabilities. The evaluation of our prototypical
system, applied on the Wikipedia text corpus, shows that our approach
augments the user experience substantially. While hovering the identi-
fiers in the formula, tool-tips with the most probable definitions occur.
Tests with random samples show that the displayed definitions provide
a good match with the actual meaning of the identifiers.
Key words: definition discovery, text mining, parallel computing
1 Introduction
Mathematical formulae are viable sources of
information for a wide range of scientists. Of-
ten, they contain identifiers whose meaning
might be at first unknown or at least ambigu-
ous to the reader (depending on their knowl-
edge). Therefore, one usually needs to study
the surrounding text to find the relevant def-
inition. An automatic information retrieval
system can be used to reduce the reader’s
effort by displaying the most relevant defini-
tion relation found to the reader. Students and Fig. 1. Screenshot of the
scientists of other disciplines would especially energy mass relation page
profit from a system that helps them to un- ‘Mass–energy equivalence’,
derstand formulae more quickly. In the long while hovering the letter ‘E’.
2 Pagel, Schubotz
term, the extracted identifier definition tuples contribute to an increased ma-
chine readability of scientific publications. This builds a foundation for added
value services such as search, clustering and improved accessibility.
To build such a system, a labelled text corpus that annotates identifiers and
their definition is desirable. At the project start, such a corpus was not available.
Consequently we had to start manual investigation of individual articles. Our
first observation was that many identifier definitions use a fixed string pattern
to explain the definition to the reader. Furthermore, most definitions usually
appear very close to the related identifier in the sentences. Thus, we calculate
the probabilities for correct identifier definition tuples based on distance metrics
for certain part-of-speech (POS) tagged words. This correlates to the experience
that readers usually extract identifier definitions from context that is given by
the surrounding text.
We chose the Wikipedia as the target text corpus because of two facts. First,
most articles make use of tags (texvc as an input language) for formulae.
The identification of tags is trivial, and from the MathML output,
it is easy to extract the identifiers. Second, the articles are already annotated
with mark-up. Particularly, hyperlinks to other articles within Wikipedia are of
interest as they typically wrap around any number of words and indicate that
these in combination are relevant in the given context or (respectively) sentence.
The English Wikipedia contains roughly four million articles. Even if we only
pick articles containing tags, our processor still needs to compute with
tens of thousands of articles. Especially when using text annotators (e.g., POS
tagger [8]), like Stanford’s NLP framework, one can make use of a parallel pro-
cessing system to speed up computation. We implement the proposed strategy
with the Stratosphere system [3]. It is based on the PACT programming model
[2], which enables us to rapidly generate a large amount of definition relation
candidates with only minimal implementation overhead for the parallelization.
Related Work. Quoc et al. [7] proposed an approach for relating whole formulas
to sentences and their describing paragraphs. Yokoi et al. [10] trained a sup-
port vector machine to extract natural language descriptions for mathematical
expressions. Furhter work in this field was done by [4] and [5].
2 Pattern-based Definition Discovery
At first, we implemented a simple identifier definition extractor that is based on
a set of static patterns. As this is a fairly robust approach and easy to implement,
it serves as a good reference point in terms of performance. It simply iterates
through the text, trying to find word groups, that are matched by a pattern. The
patterns being used to discover description terms are depicted in Table 1. Due
to the fact that we already tokenized and annotated the articles in a previous
step in the MLP system, we can make use of POS tags here as well.
Note, determiners not only contain articles, but also quantifiers and distribu-
tives. The last pattern in Table 1 contains ‘*/DT’. This is a shorthand for every
MLP 3
word, that has the POS tag ‘DT’ (determiner). Otherwise this pattern would be
rather large, as it needs to contain every possible determiner. IDENTIFIER as
well as DESCRIPTION are place-holder, that mark the positions of the entities
from a possible definition relation.
Pattern
is
is the
let be the
is|are denoted by
denotes */DT
Table 1. Implemented static patterns
3 Statistical Definition Discovery
We detect relations between identifiers and their description in two steps. First,
we extract the identifiers from the formulae found in an article, and second we
determine their description from the surrounding text.
Extracting relevant identifiers from the article relies on the assumption that
the author will use tags for all formulae. This said, a formula that
is written in the running text cannot be recognized, and therefore, cannot be
extracted by our system.
The fact that we can estimate all relevant identifiers for an article (see Section
4.1), combined with some common assumptions about definition relations, can
be exploited to largely reduce the set of candidates that need to be ranked.
Please note that this reduction is essential for retrieving the correct relations for
our approach. Otherwise almost any word would be ranked and the precision of
the retrieval would drop significantly.
The basic assumption of our approach is that the two entities of a definition
relation co-occur in the same sentence. In other words, if we want to retrieve
the description for an identifier, only sentences containing the identifier could
include the definition relation. Having said this, any other sentences can be
ignored. Furthermore, we assume that it is more likely to find the description
in first sentences than in the latter. This is based on the idea that authors
introduce the meaning of an identifier and than subsequent use the identifier,
without necessarily repeating its definition.
Another assumption can be made about the lexical class of the definition
relation we want to rank. The descriptions are nouns or even noun phrases
(e.g., ‘the effective spring constant k’ or ‘mass m of something’). We
discard all other words (according to their POS tag) except noun phrases and
Wikipedia hyper-links. These are the candidates descriptions for a definition
4 Pagel, Schubotz
relation. Noun phrases and hyper-links may consist of multiple words. For all
intents and purposes, it is not necessary to threat noun phrases and hyper-links
as a set of words, and therefore, they will be treated subsequently as if they were
one. This is important, due to the fact that the overall ranking will be greatly
influenced by the distance of candidates to the position of the identifier.
3.1 Numerical Statistics
Each description candidate is ranked with the weighted sum
αRσd (∆) + βRσs (n) + γ tf(t, s)
R(n, ∆, t, d) = 7→ [0, 1]. (1)
α+β+γ
The weighted sum depends on the distance ∆ (amount of word tokens) be-
tween identifier and the description term t, the sentence number n counting
(from the beginning of the article) all sentences containing the identifier, and
the term frequency tf(t,
h s) in the
i set of sentences s. The distance was normalized
2
with Rσ (∆) = exp − 21 ∆σ−12 . We assume that the probability to find a rela-
tion at ∆ = 1 is maximal. For example in the text fragment ‘the energy E,
and the mass m’, in order to determine the full width at half maximum of our
distribution, weqevaluated some articles manually and found Rσd (1) ≈ 2Rσd (5)
12
and thus σd = ln 2 . The probability to find a correct definition decays to 50%
−1
within three sentences. Consequently σs = 2 (ln 2) 2 .
Robustness. The classic tf-idf [9] statistic reflects the importance of a term to
a document. For our task, the inverse document frequency (idf) assigns high
penalties to frequent words like ‘length’, as opposed to words seldom seen such
as ‘Hamiltonian’. These are both valid definitions for identifiers. As the influence
of tf(t, s) on the sensitivity of the overall ranking (1) seems to be very high, we
reduce the impact with the tuning parameters γ = 0.1 and remain α = β = 1.
Please note that the algorithm currently only takes sentences into account which
were found in a single article. In the future, the MLP system will examine sets
of closely related articles. This will leverage the problem that distributional
properties will be volatile on term universes with very few members (e.g., term
frequencies in a single sentence).
Implementation. We implemented the MLP processing system [6] as Strato-
sphere data-flow using Java which allows for scalable execution, application of
complex higher order functions, and easy integration of third party tools such
as Stanford NLP and the Mylyn framework for mark-up parsing.
MLP 5
Map
Tagger
Map CoGroup Reduce
Wiki Dumps Raw Candidates
Parser Kernel Filter
Fig. 2. Data flow of the Stratosphere program
4 Evaluation
4.1 Identifier Retrieval
Throughout our experiments, we made some observations that had an impact on
the accuracy of retrieving the correct set of identifiers. First of all, people tend
to use texvc only as a typesetting language and neglect its semantic capabilities.
For example, \text{log} is more often used than the correct operator \log.
Another problem is that sometimes people use indices as a form of ‘in field’
annotation, like Tbef ore and Taf ter . The identifier T is defined in the surrounding
text, but neither Tbef ore nor Taf ter . There are more ambiguities. For example
the superscripted 2 in x2 and σ 2 can be interpreted as the power or as a part of
the identifier. Another ambiguity is that the multiplication sign can be omitted,
so that it is undecidable for a naive program whether ab2 contains one or two
identifiers.
We took a very conservative approach and preprocessed all formulas. The
TEX command \text{} blocks along with subscriptions containing more than a
single character will be removed before analysis. Superscripts will also be ignored
in terms of being a part of the identifier. Moreover, we created a comprehensive
blacklist to improve the results further. Identifier like ‘a’, ‘A’, and ‘I’, which
are also very common in the English language, could be easily matched by our
processor in the surrounding text, and therefore, will also be blacklisted. Addi-
tionally, we blacklist common mathematical operators, constants, and functions.
We took a sample of 30 random articles and counted all matches by hand. The
resulting estimates for the identifier retrieval performance are Recall: 0.99 and
Precision: 0.86, which satisfy our information needs, as we are mostly interested
in recall at this stage.
4.2 Description Retrieval
We ran our program on a dataset of about 20,000 articles, all containing
tags, and retrieved about 550,000 candidate relations. The most common defi-
nition relations are listed in table 2.
6 Pagel, Schubotz
Identifier Descriptions Count
n number 1709
t time 1480
M mass 1042
r radius 752
T temperature 666
θ angle 639
G group 635
Table 2. Most common definition relations
Observations. We observed some poorly ranked relations. For example, in the
fragment ‘where φ( ri ) is the electrostatic potential’, the distance is
∆(φ, electrostatic potential) = 6. This is due to counting brackets and function ar-
guments as words. Also wrongly tagged words like ‘Hamiltonian’ as an adjective
leads to false negatives.
Comparative Evaluation At the start of our project there were no gold stan-
dard datasets available to measure the performance of identifier definition ex-
tractors. Thus, we created one on our own. This is a very time consuming job.
At the moment, the dataset only contains two large articles (revision ids are
included) with around 100 identifier definitions. This dataset is also available on
the project repository.
As in many articles, those in the evaluation dataset contain identifiers whose
description cannot be retrieved. This is due to two reasons. First and foremost,
the identifier found in a formula is never mentioned in the surrounding text, and
therefore, no description can be extracted. Second, the identifier is somehow
ambiguous (see Section 4.1) and has been dropped. Most notably, identifiers
like Ixx will be discarded because of an ambiguous index that contains multiple
letters.
Unfortunately 32 out of 99 identifiers from our dataset fall into that category.
We’ve decided to evaluate the performance of the remainder, as those 32 do not
convey any conceptual flaws. From the users standpoint, the overall performance
(in terms of recall) of such a system would be rather annoying. As we are only
interested in evaluating the performance of the MLP Ranking algorithm itself,
it is safe to ignore those 32 identifiers.
MLP-Ranking (k = 1) MLP-Ranking (k = 2) Pattern Matching
Precision 0.872 0.915 0.911
Recall 0.839 0.892 0.733
Table 3. Evaluation results. Note: k equals the amount of the top ranked candidate
definitions.
MLP 7
Our results show that the unoptimized MLP approach keeps up with the
performance of the simple pattern matcher. Furthermore, we observed that it is
more robust in terms of recall, as it is less vulnerable to small changes in the
sentence structure.
5 Further work
Our original intuition was to discover grammatical patterns like ‘
indicates/stands for/denotes ’ based on the statistical find-
ings. However, our impression is that this would not lead to significant perfor-
mance gain.
The distance measure Rσd fails for the example of Fig. 1 since ∆(energy, E) =
∆(mass, m) = 2. Unfortunately, one cannot simply detect punctuation marks
and introduce some kind of directed associativity (e.g., inflicting a penalty on
the ranking if the candidate relations spans over a comma). This leads to whole
classes of relation ‘types’ (in terms of the grammatical structure) never being
retrieved. We plan to mitigate this problem by taking more closely related sci-
entific articles (based on their specific fields) into consideration and count the
frequencies of the candidate relations. The intuition behind this is, that articles
of the same scientific field will likely use the same definition for the identifiers.
Moreover, we also hope to resolve the problem of ‘dangling’ identifiers (those not
mentioned in the article itself), as they might be described in related articles.
Currently, we use the ranking R to identify the most probable description-
identifier tuple on each article, even if it occurs multiple times on the page.
For example, in the ‘Mass-energy equivalence’ article, 21 sentences contain the
combination of thePn identifier E and the noun ‘energy’. A promising approach,
is to use RΣ = i=1 2−i Ri , where Ri is a sorted list. Here, R1 is the highest
ranked definition for that relation according to the current measure R. A sys-
tematic approach for determining a wise choice of the ranking parameters should
significantly improve the overall performance of our system.
6 Conclusion
Our experiments showed that selecting candidates according to their POS tags
combined with numerical statistics about the text surface, can lead to qual-
ity results. However, this approach is only applicable under certain conditions.
For identifiers which are seldom seen, our statistical approach tends to fail. In
that situation, other methods, especially supervised ones, are preferred. Unfortu-
nately, many of them require a labelled test corpus to measure the performance
of a classifier that could be trained with our generated data. Currently, we are
planning to use the NTCIR-Math-10 Task, Math Understanding Subtask gold
standard dataset [1] for a comparable evaluation.
During this project we had the impression that one could discover ‘names-
paces’ (sets of documents, that use the same definitions for identifier) to aid
in the retrieval process. Robert Pagel is currently working on this topic for his
diploma thesis.
Acknowledgments. Thanks to Howard Cohl for proofreading the paper and to
Holmer Hemsen, the course instructor of the database project course at TU-
Berlin in Fall 2012. The implementation and a first draft of this paper was
completed in the duration of this course.
Bibliography
[1] Aizawa, A., Kohlhase, M., and Ounis, I. (2013). NTCIR-10 Math Pilot Task
Overview. In Proceedings of the 10th NTCIR Conference on Evaluation of
Information Access Technologies, pages 654–661, Tokyo, Japan.
[2] Alexandrov, A., Battré, D., Ewen, S., Heimel, M., Hueske, F., Kao, O., Markl,
V., Nijkamp, E., and Warneke, D. (2010). Massively parallel data analysis with
PACTs on Nephele. Proceedings of the VLDB Endowment, 3:1625–1628.
[3] Alexandrov, A., Bergmann, R., Ewen, S., Freytag, J.-C., Hueske, F., Heise,
A., Kao, O., Leich, M., Leser, U., Markl, V., et al. (2014). The stratosphere
platform for big data analytics. The VLDB Journal, pages 1–26.
[4] Ganesalingam, M. (2013). The Language of Mathematics. Springer.
[5] Kamareddine, F. and Wells, J. B. (2008). Computerizing mathematical text
with mathlang. Electronic Notes in Theoretical Computer Science, 205:5–30.
[6] Pagel, R. (2013). Mlp project repository. https://github.com/rbzn/
project-mlp.
[7] Quoc, M. N., Yokoi, K., Matsubayashi, Y., and Aizawa, A. (2010).
Mining coreference relations between formulas and text using Wikipedia.
(August):69–74.
[8] Ratnaparkhi, A. (1996). A maximum entropy model for part-of-speech tag-
ging.
[9] Salton, G. and McGill, M. J. (1986). Introduction to Modern Information
Retrieval. McGraw-Hill, Inc., New York, NY, USA.
[10] Yokoi, K., Nghiem, M.-q., Matsubayashi, Y., and Aizawa, A. (2010). Con-
textual Analysis of Mathematical Expressions for Advanced Mathematical
Search.