=Paper=
{{Paper
|id=Vol-2634/DP3
|storemode=property
|title=Understanding Scientific Documents with Synthetic Analysis on Mathematical Expressions and Natural Language
|pdfUrl=https://ceur-ws.org/Vol-2634/DP3.pdf
|volume=Vol-2634
|authors=Takuto Asakura
|dblpUrl=https://dblp.org/rec/conf/mkm/Asakura19
}}
==Understanding Scientific Documents with Synthetic Analysis on Mathematical Expressions and Natural Language==
Understanding Scientific Documents with Synthetic Analysis on Mathematical Expressions and Natural Language Takuto Asakura Department of Informatics, SOKENDAI 1 Introduction Converting Science, Technology, Engineering, and Mathematics (STEM) documents to formal expressions has a large impact on academic and industrial society. It enables us to construct databases of mathematical knowledge, search for formulae, and develop a system that generates executable codes automatically. However, the conversion is an exceedingly ambitious goal. Mathematical expressions are commonly used in scientific communication in numerous fields such as mathematics and physics, and in many cases, they express key ideas in STEM documents. Despite the importance of mathematical expressions, formulae and texts are complementary to each other, and those in documents cannot be understood independently. Thus, deep synthetic analyses on natural language and mathematical expressions are necessary. To date, a large number of efforts have been made for developing Natural Language Processing (NLP) tech- niques, including semantic parsing [4], but their targets are mostly ‘general’ texts. Naturally, conventional NLP techniques include only limited features to treat formulae and numerous linguistic phenomena specific to STEM documents [3]. Meanwhile, semantics on mathematical expressions also has been deeply investigated. Such results can be seen in logic theories, MathML specification [1], etc. However, there is a large space between formal expressions such as first-order logic and actual formulae in natural language texts. 2 Research Goals There are a number of remaining works to achieve the conversion from STEM documents to a computational form (Figure 1). At first, we are going to focus on the two foundational parts for the synthetic analyses. The first one is token-level analyses on formulae. The main part of the analyses is associating formulae tokens to mathematical objects and text fragments (Section 2.1). This is a primal step for the conversion, but it is still almost untouched. The second one is the morphology of mathematical expression and semantics covering both formulae and texts (Section 2.2). Studying underlying theories is essential to deeply understand the structure of STEM document, and aim for the practical application by a bottom-up approach. 2.1 Associating Tokens in Formulae with Mathematical Objects and Their Descriptions in Texts Tokens in formulae (e.g., x, ε, ×, log) and their combination can refer to mathematical objects. We human beings are able to detect what each token or combination pointing to, by using common sense, domain knowledge, and referencing descriptions in the document or in the others. This detection is fundamental and should be one of the initial steps for understanding STEM documents, but unfortunately, it cannot be easily done by a machine. There are at least four factors which make the detection highly challenging: (1) ambiguity of tokens, (2) syntactic ambiguity of formulae, (3) necessity for common sense and domain knowledge, and (4) severe abbreviation. These Copyright © by the paper’s authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). In: C. Kaliszyk, E. Brady, J. Davenport, W.M. Farmer, A. Kohlhase, M. Kohlhase, D. Müller, K. Pąk, and C. Sacerdoti Coen (eds.): Joint Proceedings of the FMM and LML Workshops, Doctoral Program and Work in Progress at the Conference on Intelligent Computer Mathematics 2019 co-located with the 12th Conference on Intelligent Computer Mathematics (CICM 2019), Prague, Czech Republic, July 8–12, 2019, published at http://ceur-ws.org Mathematical Natural language expressions Applications Formula Sentence analyses analyses Token-level Word-level analyses analyses Morphology & Semantics Figure 1: Overview of our task definitions. At first, we are to tackling the token-level analyses on mathematical expressions (Section 2.1) and theories covering both formulae and texts (Section 2.2). Table 1: Usage of character y in the first chapter of PRML (except exercises). Underlines by the author. Text fragment from PRML Chap. 1 Meaning of y . . . can be expressed as a function y(x) which takes . . . a function which takes an image as input . . . an output vector y, encoded in . . . an output vector of function y(x) . . . two vectors of random variables x and y . . . a vector of random variables Suppose we have a joint distribution p(x, y) from . . . a part of pairs of values, corresponding to x difficulties often appear in formulae; giving an example for (1) as a representative, only in the first chapter of a book Pattern Recognition and Machine Learning (PRML) [2], a character y (letter ‘y’ in bold roman) is used in several meanings including a function, vectors, and a value (Table 1). The other part of the initial steps of understanding STEM documents is that connecting text fragments to the subjective mathematical objects. Our hypothesis is that for this step, general NLP approaches such as dependency parsing are more or less applicable. Of course, some tuning for STEM documents will be required, and also this process might need to be done interactively with the mathematical object detection for formulae. 2.2 Semantics and Morphology Semantics on natural language and mathematical expressions have been studied separately. However, to under- stand STEM document, it is important to investigate a synthetic semantics covering both of texts and formulae. Though morphology has been studied for natural languages, not so much for formulae. As a matter of fact, in terms of morphology, words also exist in formulae. For instance, a token M is a word in “Matrix M ”, but M is not a word in “An entry Mi,j ” (Mi,j is a word). Unlike morphemes in natural language, tokens in formulae do not have lexical categories, but some symbols (e.g., parentheses and equal sign) and positional information (e.g., superscript and subscript) have typical usages. 3 Completed and Remaining Research For the beginning of our research, we simplified the detection task which we described in Section 2.1. Specifically, we are giving annotations on some research papers in the following manner: 1. Detecting minimal groups of tokens (we call them chunks) each of which refers to a mathematical object (chunking). 2. Categorizing chunks by the mathematical object they referring to. This annotation (we call it pilot annotation) is the fundamental process to create the first gold dataset for associating tokens and mathematical objects. The annotated data will also be helpful for investigating the morphology on mathematical expressions. In other words, we defined a classification task before annotating descriptions for tokens. Since there are many ways to describe a mathematical object, this classification can be done more coherently through the pilot annotation. Moreover, we are expecting that the classification is naturally rather easier to be automated than giving descriptions automatically for the first attempt. Besides the pilot annotation, all the works which have to be done to achieve our goal are remaining. For the next step, we are planning to automate the annotation process by using features such as apposition nouns and syntactic information in formulae. At the same time, we have to decide the form of mathematical objects. For now, we can say that every mathematical object should have a description and some attributes such as types (e.g., int and float). What attributes are necessary and sufficient is still not clear, and we will find it out after trying the annotation for several documents. 4 Publication Plans and Evaluation Plans Currently, we are creating a new language resource as the pilot annotation, and we are planning to publish it for the community of language resources. For the further future, we will develop automation algorithms for mathematical object detection, which are works suitable for NLP and digital mathematical library community, including CICM. The analyses on underlying morphology and semantics are more like works in computational linguistics. For the initiative dataset, it is better to make agreements among a few experts if possible. Following progress on developing algorithms and analyses on linguistic phenomena should be evaluated with our handmade gold datasets. References [1] Ron Ausbrooks et al. Mathematical Markup Language (MathML) 3.0 Specification. World Wide Web Con- sortium (W3C), 2014. [2] Christopher M Bishop. Pattern recognition and machine learning. Springer, 2006. [3] Michael Kohlhase and Mihnea Iancu. “Co-Representing Structure and Meaning of Mathematical Docu- ments”. In: Sprache und Datenverarbeitung, International Journal for Language Data Processing 38.2 (2014). [4] Siva Reddy et al. “Transforming Dependency Structures to Logical Forms for Semantic Parsing”. In: Trans- actions of the Association for Computational Linguistics (2016).