-

Augmenting Mathematical Formulae for More Effective Querying & Presentation

2016

1 Summary Scientists and engineers search regularly for wellestablished mathematical concepts, expressed by mathematical formulae. Conventional search engines focus on keyword based text search today. An analogue approach does not work for mathematical formulae. Knowledge about identifiers alone is not sufficient to derive the semantics of the formula they occur in. Currently, for formula related inquiries the solution is to consult domain experts, which is slow, expensive and non‐deterministic. The first challenge, content augmentation, is to collect the full semantic information about individual formula from a given input. Most fundamentally, this might start with digitization of ana‐ logue mathematical content, captures the conversion from imperative typesetting instructions (i.e. TEX) to declarative layout descriptions (i.e. presentation MathML) but also deals about inferring the syntactical structure of a formula (i.e. the expression tree often represented in content MathML). In addition, this first challenge involves the association of formula metadata such as constraints, identifier definitions, related keywords or substitutions with individual formulae. or readers to identify related work while viewing a certain formula. The third challenge is content indexing for growing data sets. This challenge includes the scalable execution of the solutions to the two aforementioned challenges. While well‐established from the area of database systems i.e. XML processing and indexing can be applied, math specific complexity problems require individual solutions. The second challenge is content querying. This Math search engines have used some form of ranges from query formulation, to query pro‐ similarity measure since the early days of Math cessing, actual search, hit ranking to result search in the 2000'th. However, Youssef and presentation. There are different forms of for‐ Zhang [1] were the first who branded 5 factors mula queries. Standard ad‐hoc retrieval queries, that contribute to similarity measures in July where a user defines the information need and 2014. Those factors are the starting point for my the math information retrieval system returns a systematic approach to formula similarity ranked list given a particular data set. Similar is measures, which extends their work in the followthe interactive formula filter queries, where a ing way: user filters a data set interactively until she derives at the result set, which is relevant to her First, I differentiate between proper‐ and contexneeds. Different are unattended queries that run tual formula similarity factors. Proper factors are in the background to assist authors during editing quantified by applying a distance measure to a Moritz Schubotz

Augmented content (challenge 1) opens up addi‐ tional options for similarity search, and poten‐ tially improves the search results regardless of the applied similarity measure. In order to sepa‐ rate the effect of content augmentation from in‐ trinsic improvements in the applied similarity fea‐ tures, I develop measures for formula data qual‐ ity that separates those aspects. Afterwards I compare similarity measures given a certain data quality based on that quality measure. The quality measure itself is a valuable contribution. Its use is not limited to search. For example a quality meas‐ ure can assist authors to check their documents for (1) missing definitions, (2) ambiguities, (3) de‐ pendency problems, and (4) redundancy. Note that I focus on quality measure for individual for‐ mulae assuming that the relevant meta‐infor‐ mation has already been extracted from the sur‐ rounding text. Since the developed data quality measures are tailored to my approach of similar‐ ity search I’ll give more details on the data quality measures after having introduced the similarity factors below. In addition, I will describe existing MIR systems using the notation of the discussed similarity fac‐ tors. This will provide a template, on using for‐ mula similarity factors as building blocks for spe‐ cialized applications or future search MIR sys‐ tems. Eventually, not all features can be de‐ scribed by the identified factors. In that case I would refine the factor list to ensure that all sys‐ tems participated in the NTCIR 10 and 11 chal‐ lenges can be modelled using the factors as build‐ ing blocks. relevance judgement for document sections on the one hand, and on known item retrieval, on the other hand. For the impact analysis of individ‐ ual similarity factors, I was using (1) gold standard driven sensitivity analysis [2] and (2) known item based evaluation [4]. The two metrics discussed above measure not only the performance of MIR systems, but they also evaluate the performance of similarity factors individually. The similarity measures evaluated were taken from existing MIR systems and additional measures proposed in the literature and taken from other disciplines. One result of this evaluation is that some factors, namely, those in the group of proper semantic similarity measures, require a minimum level of data quality to contribute to search result quality in a meaningful way. By collecting a large and heterogeneous sample of similarity measures, I am confident to have laid a good foundation to evaluate measures that will be developed in the future. I used the existing arXiv corpus for the evaluation. In addition I cre‐ ated, based on the experiences gained with that corpus, an additional corpus from Wikipedia. Since the HTML formats generated in this corpus Second, I analyze the impact of each individual was obtained automatically from LaTeX or Wik‐ factor and the inference patterns between differ‐ iText, respectively, data quality is not perfect, but ent factors. Additionally, I create case studies and I consider it good enough to get qualitative in‐ templates on how different factors can be com‐ sights about the impact on individual measures. bined into use‐case specific similarity measures. In addition, I use the DLMF/DRMF data‐sets which are partially available in different levels of The identified factors imply dimensions of the data quality to analyze the impact of data quality aforementioned formula quality measure. on formula search and individual factor effect. I Namely, I define four dimensions of formula qual‐ expect to see that the original version of LaTeXML ity: (1) typographic and lexical quality; (2) syntac‐ without any content enrichment has the lowest tic structure quality; (3) semantic quality; and (4) data quality and lowest precision in search result. metadata quality. An example for low syntactic The DRMF content augmentation process will data quality is misinterpretation of as have raised the data quality and also improved rather than “applied at” . In search results. The best results with regard to a search context this might hinder relevant re‐ data quality and search results are expected to be sults from being matched. The associated quality obtained by using the manually generated da‐ measure for the structural data quality needs to taset of DLMF chapter 1‐4 by Zhang and Youssef. measure to which degree the structure of the for‐ mula was captured correctly. Note that the qual‐ ity measure for contextual information needs to be related to a main unit of possible relevant meta‐information.