-

Hagenberg, Austria

Formula Concept Discovery and Recognition

Philipp Scharpf Dept. of Computer

Information Science Konstanz

Germany philipp.scharpf@uni-konstanz.de

2018

1 3 08

In my dissertation, I will develop a method to discover (de ne) and recognize (identify) formula concepts in Wikipedia articles and STEM documents using Wikidata as a semantic knowledge-base. Both structural (syntax tree) and semantic (identi er names) formula information will be considered. The approach is expected to improve search engines, recommender systems, plagiarism and novelty detection and ontology learning. Research Motivation My research is motivated by 1) the need for Information Retrieval systems to match mathematical formulae when assessing semantic content and similarity of STEM documents, and 2) the challenge that a given mathematical formula concept usually appears in several variations or equivalent representations.

retrieved from the arXiv repository of electronic preprints (http://arxiv.org/) and Wikipedia. I am striving to develop a method that will be able to map, e.g., all of the formulae collected in gure 1 - in particular, linkable to the Wikidata entry https://www.wikidata.org/wiki/Q868967. I chose the semantic knowledge-base Wikidata because it is free, open and can be read and edited by humans and machines.

Research Method Formula Feature Analysis The rst step in the formula feature analysis is tokenization, i.e., the decomposition into their components (identi ers, operators, numbers, etc.) and Part-Of-Math-tagging: a formula consists of di erent terms, which pmhuyssticbse, adgisatiinnguusiesdheads farnominsetarcuhctoitvheeer.xaTmhpeleK, lceoinn-tGaionrsdoanteerqmuatc1i2o@n@t22c12 @@wt22ith a rdo2ub+le mtih2m2c2e d=eri0vaftriovme, qounaenwtuitmh a double space derivative r2 as well one with a constant prefactor mh22c2 -the rst term can then be further decomposed into its characters (tokens), that is, the denominator c for the speed of light, the operator @t with an exponent (number) 2 and the identi er for the physical (quantum) wave function.

When analyzing the semantics of a formula, we are faced with the problem of identi er ambiguity, which requires disambiguation with the help of the partial clari cations available in the text. A single identi er has a theoretically unlimited number of possible meanings, e.g., E in physics often refers to both an energy and an electric eld, generally mathematically an expected value, etc. Thus, it is essential to improve the retrieval of the semantics from the surrounding text.

Research Questions The aim of Formula Concept Discovery (FCD) is to 1) retrieve a large number of formula examples from Wikipedia articles and arXiv documents together with a mapping to formula concepts (Wikidata items), and 2) recover a general de nition of a formula concept using feature analysis and abstract mathematical formalization. The aim of Formula Concept Recognition (FCR) is to identify formulae in arXiv documents or Wikipedia articles as Wikidata formula concept items. Therefore, a measure of similarity that allows assigning a formula to a mathematical concept (equation) if it exceeds a de ned threshold needs to be de ned. A rst rough approach could be a matching score = # recognized elements / # total elements. To successfully identify a single element, for example, the Laplace operator r2 = , it must be assigned to the corresponding concept in Wikidata, at https://www.wikidata.org/wiki/Q203484, i.e. to QID Q203484. The aim is to motivate active users of Wikidata to gradually build a hierarchical structure of the formula elements, assign elements to all available formulae (property has part ) and create new items for formulae concepts directly including the parts. Evaluation Plans I will compare and discuss 1) several possible Formula Concept Discovery methods (e.g., taking the rst formula from a Wikipedia article as de ning formula of the concept, formula clustering, etc.), and 2) several possible Formula Concept Recognition methods (e.g., simple TeX string search vs. parts identi cation, recognition by identi er name, symbol and value, etc.).

Completed Research In my rst publication [SGPS+18], I signi cantly contributed to the creation of a Gold standard MathMLben for the evaluation of the conversion between di erent mathematical formats (LaTeX vs. Computer Algebra Systems). In my second publication [SSD+18], I presented the rst math-aware QA system that can answer a natural language question yielding a mathematical formula using Wikidata. My third recent publication [SSG18] initiates my reasoning on a de nition of a formula concept and its possible content representations in LaTeX, MathML, and Wikidata.

Remaining Research In my next publication, I will provide a thorough literature review on formula feature analysis. Together with M. Schubotz and A. Greiner-Petter, I am planning to develop an annotation tool AnnotaTeX for LaTeX documents that will facilitate the annotation process by recommending identi er names to the user. Figure 1 (right) shows a proposed User Interface. [SGPS+18] Moritz Schubotz, Andre Greiner-Petter, Philipp Scharpf, Norman Meuschke, Howard Cohl, and Bela

Gipp. Improving the representation and conversion of mathematical formulae by considering their [SSG18]

textual context . In Proceedings of the ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL) , Fort Worth, USA, Jun. 2018 .

Moritz

Schubotz , Philipp Scharpf, Kaushal Dudhat, Yash Nagar, Felix Hamborg, and

Bela

Gipp . Introducing mathqa - a math-aware question answering system . In Proceedings of the ACM/IEEECS Joint Conference on Digital Libraries (JCDL) , Workshop on Knowledge Discovery, Fort Worth, USA, Jun. 2018 .

Philipp

Scharpf , Moritz Schubotz, and

Bela

Gipp . Representing mathematical formulae in content mathml using wikidata . In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) , 2018 .