-

Mathematical Language Processing Pro ject

Robert Pagel

rob@clabs.cc 0

Moritz Schubotz

schubotz@tu-berlin.de 0 0 Database Systems and Information Management Group, Technische Universitat Berlin , Einsteinufer 17, 10587 Berlin , Germany

In natural language, words and phrases themselves imply the semantics. In contrast, the meaning of identi ers in mathematical formulae is unde ned. Thus scientists must study the context to decode the meaning. The Mathematical Language Processing (MLP) project aims to support that process. In this paper, we compare two approaches to discover identi er-de nition tuples. At rst we use a simple pattern matching approach. Second, we present the MLP approach that uses part-of-speech tag based distances as well as sentence positions to calculate identi er-de nition probabilities. The evaluation of our prototypical system, applied on the Wikipedia text corpus, shows that our approach augments the user experience substantially. While hovering the identiers in the formula, tool-tips with the most probable de nitions occur. Tests with random samples show that the displayed de nitions provide a good match with the actual meaning of the identi ers.

de nition discovery text mining parallel computing

Mathematical formulae are viable sources of information for a wide range of scientists. Often, they contain identi ers whose meaning might be at rst unknown or at least ambiguous to the reader (depending on their knowledge). Therefore, one usually needs to study the surrounding text to nd the relevant definition. An automatic information retrieval system can be used to reduce the reader's e ort by displaying the most relevant de nition relation found to the reader. Students and Fig. 1. Screenshot of the scientists of other disciplines would especially energy mass relation page pro t from a system that helps them to un- `Mass{energy equivalence', derstand formulae more quickly. In the long while hovering the letter `E'. term, the extracted identi er de nition tuples contribute to an increased machine readability of scienti c publications. This builds a foundation for added value services such as search, clustering and improved accessibility.

To build such a system, a labelled text corpus that annotates identi ers and their de nition is desirable. At the project start, such a corpus was not available. Consequently we had to start manual investigation of individual articles. Our rst observation was that many identi er de nitions use a xed string pattern to explain the de nition to the reader. Furthermore, most de nitions usually appear very close to the related identi er in the sentences. Thus, we calculate the probabilities for correct identi er de nition tuples based on distance metrics for certain part-of-speech (POS) tagged words. This correlates to the experience that readers usually extract identi er de nitions from context that is given by the surrounding text.

We chose the Wikipedia as the target text corpus because of two facts. First, most articles make use of <math/> tags (texvc as an input language) for formulae. The identi cation of <math/> tags is trivial, and from the MathML output, it is easy to extract the identi ers. Second, the articles are already annotated with mark-up. Particularly, hyperlinks to other articles within Wikipedia are of interest as they typically wrap around any number of words and indicate that these in combination are relevant in the given context or (respectively) sentence.

The English Wikipedia contains roughly four million articles. Even if we only pick articles containing <math/> tags, our processor still needs to compute with tens of thousands of articles. Especially when using text annotators (e.g., POS tagger [ 8 ]), like Stanford's NLP framework, one can make use of a parallel processing system to speed up computation. We implement the proposed strategy with the Stratosphere system [ 3 ]. It is based on the PACT programming model [ 2 ], which enables us to rapidly generate a large amount of de nition relation candidates with only minimal implementation overhead for the parallelization. Related Work. Quoc et al. [ 7 ] proposed an approach for relating whole formulas to sentences and their describing paragraphs. Yokoi et al. [ 10 ] trained a support vector machine to extract natural language descriptions for mathematical expressions. Furhter work in this eld was done by [ 4 ] and [ 5 ]. 2

Pattern-based De nition Discovery

At rst, we implemented a simple identi er de nition extractor that is based on a set of static patterns. As this is a fairly robust approach and easy to implement, it serves as a good reference point in terms of performance. It simply iterates through the text, trying to nd word groups, that are matched by a pattern. The patterns being used to discover description terms are depicted in Table 1. Due to the fact that we already tokenized and annotated the articles in a previous step in the MLP system, we can make use of POS tags here as well.

Note, determiners not only contain articles, but also quanti ers and distributives. The last pattern in Table 1 contains `*/DT'. This is a shorthand for every word, that has the POS tag `DT' (determiner). Otherwise this pattern would be rather large, as it needs to contain every possible determiner. IDENTIFIER as well as DESCRIPTION are place-holder, that mark the positions of the entities from a possible de nition relation.

Pattern <description> <identifier> <identifier> is <description> <identifier> is the <description> let <identifier> be the <description> <description> is|are denoted by <identifier> <identifier> denotes */DT <description> We detect relations between identi ers and their description in two steps. First, we extract the identi ers from the formulae found in an article, and second we determine their description from the surrounding text.

Extracting relevant identi ers from the article relies on the assumption that the author will use <math/> tags for all formulae. This said, a formula that is written in the running text cannot be recognized, and therefore, cannot be extracted by our system.

The fact that we can estimate all relevant identi ers for an article (see Section 4.1), combined with some common assumptions about de nition relations, can be exploited to largely reduce the set of candidates that need to be ranked. Please note that this reduction is essential for retrieving the correct relations for our approach. Otherwise almost any word would be ranked and the precision of the retrieval would drop signi cantly.

The basic assumption of our approach is that the two entities of a de nition relation co-occur in the same sentence. In other words, if we want to retrieve the description for an identi er, only sentences containing the identi er could include the de nition relation. Having said this, any other sentences can be ignored. Furthermore, we assume that it is more likely to nd the description in rst sentences than in the latter. This is based on the idea that authors introduce the meaning of an identi er and than subsequent use the identi er, without necessarily repeating its de nition.

Another assumption can be made about the lexical class of the de nition relation we want to rank. The descriptions are nouns or even noun phrases (e.g., `the effective spring constant k' or `mass m of something'). We discard all other words (according to their POS tag) except noun phrases and Wikipedia hyper-links. These are the candidates descriptions for a de nition relation. Noun phrases and hyper-links may consist of multiple words. For all intents and purposes, it is not necessary to threat noun phrases and hyper-links as a set of words, and therefore, they will be treated subsequently as if they were one. This is important, due to the fact that the overall ranking will be greatly in uenced by the distance of candidates to the position of the identi er. 3.1

Numerical Statistics Each description candidate is ranked with the weighted sum

R(n; ; t; d) =

R d ( ) +

R s (n) + + + tf(t; s) 7! [0; 1]: (1)

The weighted sum depends on the distance (amount of word tokens) between identi er and the description term t, the sentence number n counting (from the beginning of the article) all sentences containing the identi er, and the term frequency tf(t; s) in the set of sentences s. The distance was normalized with R ( ) = exp h 21 22 1 i : We assume that the probability to nd a relation at = 1 is maximal. For example in the text fragment `the energy E, and the mass m', in order to determine the full width at half maximum of our distribution, we evaluated some articles manually and found R d (1) 2R d (5) and thus d = q l1n22 . The probability to nd a correct de nition decays to 50% 1 within three sentences. Consequently s = 2 (ln 2) 2 .

Robustness. The classic tf-idf [ 9 ] statistic re ects the importance of a term to a document. For our task, the inverse document frequency (idf) assigns high penalties to frequent words like `length', as opposed to words seldom seen such as `Hamiltonian'. These are both valid de nitions for identi ers. As the in uence of tf(t; s) on the sensitivity of the overall ranking (1) seems to be very high, we reduce the impact with the tuning parameters = 0:1 and remain = = 1. Please note that the algorithm currently only takes sentences into account which were found in a single article. In the future, the MLP system will examine sets of closely related articles. This will leverage the problem that distributional properties will be volatile on term universes with very few members (e.g., term frequencies in a single sentence).

Implementation. We implemented the MLP processing system [ 6 ] as Stratosphere data- ow using Java which allows for scalable execution, application of complex higher order functions, and easy integration of third party tools such as Stanford NLP and the Mylyn framework for mark-up parsing.

Map Tagger

Map

Parser

Wiki Dumps CoGroup

Kernel

Reduce

Filter

Raw Candidates

Throughout our experiments, we made some observations that had an impact on the accuracy of retrieving the correct set of identi ers. First of all, people tend to use texvc only as a typesetting language and neglect its semantic capabilities. For example, ntextflogg is more often used than the correct operator nlog. Another problem is that sometimes people use indices as a form of `in eld' annotation, like Tbefore and Tafter. The identi er T is de ned in the surrounding text, but neither Tbefore nor Tafter. There are more ambiguities. For example the superscripted 2 in x2 and 2 can be interpreted as the power or as a part of the identi er. Another ambiguity is that the multiplication sign can be omitted, so that it is undecidable for a naive program whether ab2 contains one or two identi ers.

We took a very conservative approach and preprocessed all formulas. The TEX command ntextfg blocks along with subscriptions containing more than a single character will be removed before analysis. Superscripts will also be ignored in terms of being a part of the identi er. Moreover, we created a comprehensive blacklist to improve the results further. Identi er like `a', `A', and `I', which are also very common in the English language, could be easily matched by our processor in the surrounding text, and therefore, will also be blacklisted. Additionally, we blacklist common mathematical operators, constants, and functions.

We took a sample of 30 random articles and counted all matches by hand. The resulting estimates for the identi er retrieval performance are Recall: 0.99 and Precision: 0.86, which satisfy our information needs, as we are mostly interested in recall at this stage. We ran our program on a dataset of about 20,000 articles, all containing <math/> tags, and retrieved about 550,000 candidate relations. The most common de nition relations are listed in table 2.

Identi er Descriptions n number t time M mass

r radius

T temperature angle G group

Observations. We observed some poorly ranked relations. For example, in the fragment `where ( ri ) is the electrostatic potential', the distance is ( ; electrostatic potential) = 6. This is due to counting brackets and function arguments as words. Also wrongly tagged words like `Hamiltonian' as an adjective leads to false negatives.

Comparative Evaluation At the start of our project there were no gold standard datasets available to measure the performance of identi er de nition extractors. Thus, we created one on our own. This is a very time consuming job. At the moment, the dataset only contains two large articles (revision ids are included) with around 100 identi er de nitions. This dataset is also available on the project repository.

As in many articles, those in the evaluation dataset contain identi ers whose description cannot be retrieved. This is due to two reasons. First and foremost, the identi er found in a formula is never mentioned in the surrounding text, and therefore, no description can be extracted. Second, the identi er is somehow ambiguous (see Section 4.1) and has been dropped. Most notably, identi ers like Ixx will be discarded because of an ambiguous index that contains multiple letters.

Unfortunately 32 out of 99 identi ers from our dataset fall into that category. We've decided to evaluate the performance of the remainder, as those 32 do not convey any conceptual aws. From the users standpoint, the overall performance (in terms of recall) of such a system would be rather annoying. As we are only interested in evaluating the performance of the MLP Ranking algorithm itself, it is safe to ignore those 32 identi ers.

MLP-Ranking (k = 1) MLP-Ranking (k = 2) Pattern Matching Precision 0.872 0.915 0.911

Recall 0.839 0.892 0.733

Our results show that the unoptimized MLP approach keeps up with the performance of the simple pattern matcher. Furthermore, we observed that it is more robust in terms of recall, as it is less vulnerable to small changes in the sentence structure. 5

Further work

Our original intuition was to discover grammatical patterns like `<identifier> indicates/stands for/denotes <description> ' based on the statistical ndings. However, our impression is that this would not lead to signi cant performance gain.

The distance measure R d fails for the example of Fig. 1 since (energy; E) = (mass; m) = 2. Unfortunately, one cannot simply detect punctuation marks and introduce some kind of directed associativity (e.g., in icting a penalty on the ranking if the candidate relations spans over a comma). This leads to whole classes of relation `types' (in terms of the grammatical structure) never being retrieved. We plan to mitigate this problem by taking more closely related scienti c articles (based on their speci c elds) into consideration and count the frequencies of the candidate relations. The intuition behind this is, that articles of the same scienti c eld will likely use the same de nition for the identi ers. Moreover, we also hope to resolve the problem of `dangling' identi ers (those not mentioned in the article itself), as they might be described in related articles.

Currently, we use the ranking R to identify the most probable descriptionidenti er tuple on each article, even if it occurs multiple times on the page. For example, in the `Mass-energy equivalence' article, 21 sentences contain the combination of the identi er E and the noun `energy'. A promising approach, is to use R = Pin=1 2 iRi; where Ri is a sorted list. Here, R1 is the highest ranked de nition for that relation according to the current measure R. A systematic approach for determining a wise choice of the ranking parameters should signi cantly improve the overall performance of our system. 6

Conclusion

Our experiments showed that selecting candidates according to their POS tags combined with numerical statistics about the text surface, can lead to quality results. However, this approach is only applicable under certain conditions. For identi ers which are seldom seen, our statistical approach tends to fail. In that situation, other methods, especially supervised ones, are preferred. Unfortunately, many of them require a labelled test corpus to measure the performance of a classi er that could be trained with our generated data. Currently, we are planning to use the NTCIR-Math-10 Task, Math Understanding Subtask gold standard dataset [ 1 ] for a comparable evaluation.

During this project we had the impression that one could discover `namespaces' (sets of documents, that use the same de nitions for identi er) to aid in the retrieval process. Robert Pagel is currently working on this topic for his diploma thesis.

Acknowledgments. Thanks to Howard Cohl for proofreading the paper and to Holmer Hemsen, the course instructor of the database project course at TUBerlin in Fall 2012. The implementation and a rst draft of this paper was completed in the duration of this course.

Bibliography

[1] Aizawa , A. , Kohlhase , M. , and Ounis , I. ( 2013 ). NTCIR-10 Math Pilot Task Overview. In Proceedings of the 10th NTCIR Conference on Evaluation of Information Access Technologies , pages 654 { 661 , Tokyo, Japan.

[2] Alexandrov , A. , Battre , D. , Ewen , S. , Heimel , M. , Hueske , F. , Kao , O. , Markl , V. , Nijkamp , E. , and Warneke , D. ( 2010 ). Massively parallel data analysis with PACTs on Nephele . Proceedings of the VLDB Endowment , 3 : 1625 { 1628 .

[3] Alexandrov , A. , Bergmann , R. , Ewen , S. , Freytag , J.-C. , Hueske , F. , Heise , A. , Kao , O. , Leich , M. , Leser , U. , Markl , V. , et al. ( 2014 ). The stratosphere platform for big data analytics . The VLDB Journal , pages 1 { 26 .

[4] Ganesalingam , M. ( 2013 ). The Language of Mathematics . Springer.

[5] Kamareddine , F. and Wells , J. B. ( 2008 ). Computerizing mathematical text with mathlang . Electronic Notes in Theoretical Computer Science , 205 :5{ 30 .

[6] Pagel , R. ( 2013 ). Mlp project repository . https://github.com/rbzn/ project-mlp.

[7] Quoc , M. N. , Yokoi , K. , Matsubayashi , Y. , and Aizawa , A. ( 2010 ). Mining coreference relations between formulas and text using Wikipedia. ( August ): 69 { 74 .

[8] Ratnaparkhi , A. ( 1996 ). A maximum entropy model for part-of-speech tagging .

[9] Salton , G. and McGill , M. J. ( 1986 ). Introduction to Modern Information Retrieval . McGraw-Hill , Inc., New York, NY, USA.

[10] Yokoi , K. , Nghiem , M.-q. , Matsubayashi , Y. , and Aizawa , A. ( 2010 ). Contextual Analysis of Mathematical Expressions for Advanced Mathematical Search.