Towards Formula Concept Discovery and Recognition Philipp Scharpf1 , Moritz Schubotz2 , Howard S. Cohl3 , and Bela Gipp2 1 Department of Computer and Information Science University of Konstanz, Germany 2 Department of Information Technology University of Wuppertal, Germany 3 National Institute of Standards and Technology, United States Abstract. Citation-based Information Retrieval (IR) methods for sci- entific documents have proven to be effective in academic disciplines that use many references. In science, technology, engineering, and mathemat- ics (STEM), researchers cite less often but employ mathematical concepts to refer to prior knowledge (Moed et al.). Our long-term goal is to gen- eralize citation-based IR-methods and apply the generalized method to both classical references and mathematical concepts. In this paper, we suggest how mathematical formulae could be cited and define a Formula Concept Retrieval challenge with two subtasks: Formula Concept Discov- ery (FCD) and Formula Concept Recognition (FCR). While the former aims at the definition and exploration of a Formula Concept that names bundled equivalent representations of a formula, the latter is designed to match a given formula to a prior assigned concept ID. Moreover, we present first Machine Learning based approaches to tackle the FCD and FCR tasks, which we apply to a standardized test-collection (NTCIR arXiv dataset). Our FCD approach yields a recall of 68% for retrieving equivalent representations of frequent formulae, and 72% for extracting the formula name from the surrounding text. FCD and FCR will enable citing formulae within mathematical documents and facilitate seman- tic search as well as similarity computations for plagiarism detection or document recommender systems. Keywords: Natural Language Processing · Mathematical Language Pro- cessing · Mathematical Information Retrieval · Feature Analysis · Ma- chine Learning 2 Philipp Scharpf, Moritz Schubotz, Howard S. Cohl, and Bela Gipp 1 Introduction Documents from Science, Technology, Engineering, and Mathematics (STEM) often contain a significant amount of mathematical formulae. Since they are vi- tal to understanding the content of these documents, semantic search engines or recommender systems need to process and analyze them alongside the text. In information science and technology, the semantics of natural language is typi- cally grasped via conceptualization [25]. In the case of mathematical language, we argue for the introduction of a definition for a mathematical Formula Con- cept as a collection of equivalent formulae with different representations (see [15] for a discussion of the definition difficulties). Once defined, the technical imple- mentation of a Formula Concept can be Formula Concept Discovery (FCD) and Formula Concept Recognition (FCR). The first term (FCD) refers to the explo- ration of formula concepts by examining a multitude of formula examples from various sources and occurrences. Figure 1 illustrates how the same equation, in this case, the Klein-Gordon equation from Quantum Physics, can be represented in different formats that seem very diverse at first glance but actually represent the same mathematical concept. We will present first implementations of FCD and FCR in the following. Fig. 1: Various representations of the Klein-Gordon equation extracted from physics papers [2], [22], [7], [21], [6], [12], [11], [4], [20]. Towards Formula Concept Discovery and Recognition 3 2 Related Work Mathematical Information Retrieval (MathIR) addresses the information need in STEM fields by retrieving, processing and analyzing mathematical formulae. Up until now, various formula search engines have been developed, and transla- tions between different markups (LaTeX, Presentation, and Content MathML) and standards elaborated [5]. Since Wikipedia is only semi-structured, Wikidata4 was launched to provide direct access to specific interlingual facts (RDF5 triples) and retrieve information systematically. Wikidata is a free and open semantic knowledge-base that can be read and edited by humans and machines [23]. Wiki- data stores items with statements and their references. In the case of mathemat- ical knowledge, this includes formulae, e.g., pressure (Q39552) with a defining formula property (P2534) p = FS . To scalably seed information into Wikidata, a Primary Sources tool6 was introduced, allowing active users to quickly browse through new claims and their references to approve or reject them. The e-Print server [10] makes available free preprints for a large collection of publica- tions from Physics, Mathematics, Computer Science, Economics and more. Many authors provide their LaTeX source code. Both Wikipedia and arXiv articles were extracted as part of the NTCIR MathIR Task [1]. In 2017, the Special In- terest group for Math Linguistics (SIGMathLing)7 was initiated as a forum and resource cooperative for the linguistics of mathematical/technical documents. For Mathematical Language Processing (MLP), the formula parts (operators, identifiers, numbers) have to be annotated using the Mathematical Markup Lan- guage (MathML). There are several tools available, most prominently the La- TeXML converter8 . Furthermore, the occurring symbols (variables, constants) need to be disambiguated, i.e., their meaning inferred from the context and se- mantically annotated. There have been attempts to automatically retrieve the semantics of identifiers from the surrounding text [18]. While Wikipedia articles more commonly contain variable definitions in the text, in general, many paper articles often omit them. This leaves the task of manual annotation inevitable for building machine-interpretable datasets. The NIST Digital Repository of Mathematical Formulae (DRMF) [3] and NIST Digital Library of Mathematical Functions (DLMF) [9] are two examples of maintained high-quality semantic datasets. At this moment, Wikidata contains approximately 3600 items with a "defining formula" property. Moreover, the benchmark MathMLben [17] was created to evaluate tools for mathematical format conversion (from LaTeX to MathML to Computer Algebra Systems), containing approximately 300 formulae from Wikipedia, the arXiV and the DLMF, which were augmented by Wikidata macros [16]. 4 5 6 7 8 4 Philipp Scharpf, Moritz Schubotz, Howard S. Cohl, and Bela Gipp 3 Formula Concept Retrieval Challenge We define as the goal to be eventually able to map all of the various representa- tions of a formula to a unique and open concept ID, e.g., linking all occurrences of the Klein-Gordon equation shown in Figure 1 to the Wikidata item Q868967 9 . We define two subtasks of the Formula Concept Retrieval challenge: – Formula Concept Discovery (FCD) as a method to find common equivalent representations and a name candidate for a given formula, and – Formula Concept Recognition (FCR) as the approach to recognize formulae in documents as being instances of prior defined formula concept. 4 Our Approach In the following, we present our first efforts to implement and evaluate a Formula Concept Discovery (FCD). We approach FCD by retrieving equivalent formu- lations with different representations (see Figure 2) as well as name candidates from the surrounding text. The initial step is to identify formula candidates which occur most often within a given dataset, assuming that they are potential seeds of popular formula concepts. We first tried formula clustering but discov- ered that it was not a suitable method for FCD since the number of clusters is a priori unclear and the tested algorithms were not able to group equivalent formulae. Subsequently, we decided to start with a ranking of formula duplicates (with the same LaTeX string), which yielded reasonable results. We employed the NTCIR arXiv dataset [1] which is comprised of 104062 document sections containing over 60 million formulae. We confined our computations to the sub- ject class of astrophysics (680 astro-ph documents), employing a domain expert to semantically evaluate the results. From the duplicate ranking, we selected a formula length range between 10 and 30 characters and restricted our selection to duplicates occurring in at least two different documents. This yielded 3495 formulae. We then manually selected all equations, and discarded all stubs with- out a right-hand-side, as well as simple variable dependence definitions, such as x = x (t) and x = y or x = const. For the first 50 samples from the duplicate ranking, we retrieved the operators and identifiers from the provided MathML and tags, as well as the surrounding text (words within a window of ±500 characters around the formula). We encoded both tags using the Tfid- fVectorizer from the Python package Scikit-learn [13] and Doc2Vec model [8] from the Python package Gensim [14]. We then compared the performance of a k-nearest neighbor classifier (Scikit-learn) on the four resulting vector encodings (math2vec [24] and math tf-idf for the formulae, semantics2vec and semantics tf-idf for the surrounding text) to retrieve equivalent representations. 9 Towards Formula Concept Discovery and Recognition 5 5 Our Results Table 1 shows the results of our approach for discovering Formula Concepts. We rank the fetched formulae by the number of duplicates d and also list the number of documents dˆ they appear in. The main investigation was to compare the performance of four different encodings in terms of the retrieved number of equivalent representations using the kNN recommendation algorithm provided by Scikit-learn. Calculating the overall success distribution, we discovered that the math2vec (em ) encoding clearly outperforms the others by yielding 71% of the retrieved instances, followed by semantics tf-idf (ês ) with 15%, semantics2vec (es ) with 11%, and math tf-idf (êm ) with 4%. On average, there were 3 matches per formula from 3 different documents. Overall, for 34/50 = 68% of the sample formulae, we could retrieve equivalent representations. Finally, we listed the five top name candidates from the surrounding text and evaluated whether they contain a suitable name for the Formula Concept to be seeded as a Wikidata item. For our 50 examples, we achieve a recall of 36/50 = 72% for the formula name. Furthermore, for 41/50 = 82% of the retrieved name candidates, there was a Wikidata QID available to tag the formula concept. 6 Future Work Having launched FCD as a method for tagging formulae with Wikidata QIDs, we can now employ FCR to identify formulae within STEM documents using their constituting parts (operators and identifiers) in a SPARQL query10 . However, since at the moment only less than 4000 formulae are seeded into Wikidata [19] and storing multiple representations as "defining formula" of the same formula concept item is not endorsed, we argue for the creation of a specific Wikidata- attached Formula Concept Database. It should include formalized augmentation to generate equivalent forms using, e.g., commutations, additional sub- and su- perscripts, unit and reference frame variations, etc. Most importantly, a method for inferring substitutions or implicit terms needs to be developed. Hubble’s law (Q179916) equation of state (Q214967) p = ωρ ȧ = aH p = κρ Hi = Ṙ/R ω = p/ρ H = ȧ/a pd = ωρd H(t) = ȧ/a Fig. 2: Clustering equivalent representations of formulae in the semantic space as named Formula Concept Wikidata items. This work was supported by the German Research Foundation (DFG grant GI-1259-1). 10 W3C Recommendation: 6 Philipp Scharpf, Moritz Schubotz, Howard S. Cohl, and Bela Gipp Table 1: Formula Concept Discovery (FCD). Top-50 results of a cross-document duplicate search in the subject class astro-ph of the NTCIR arXiv dataset. Equivalent formulae are retrieved to bundle concept candidates using a k-nearest neighbor (kNN) recommendation, while comparing the relative success s of dif- ferent encodings (math2vec: em , math tf-idf : êm , semantics2vec: es , semantics tf-idf : ês ). The number of duplicates d and originating distinct documents dˆ are shown as well as a retrieved sample formula. Furthermore, it is evaluated whether the first five words of the surrounding text are candidates for the name of the formula and whether a Wikidata QID is available. # Formula Name (QID) d / dˆ sem , sêm , ses , sês Encoding: sample formula Name candidates from surrounding text 1 H = ȧ/a hubble parameter (Q179916) 32 / 32 0.0, 0.1, 0.0, 0.9 ês : Hi = Ṙ/R hubble, parameter, time, factor, equations 2 p = ωρ equation of state (Q214967) 6/5 0.3, 0.0, 0.1, 0.6 es : pd = wρd equation, state, quintessence, expansion, pressure 3 ω = p/ρ accelerating universe (Q1049613) 4/3 0.7, 0.0, 0.0, 0.3 em : p = ωρ universe, accelerating, indefinitely, strain, values 4 p = −A/ρα dark fluid (Q5223514) 4/4 0.7, 0.0, 0.3, 0.0 em : p = − ρAα chaplygin, gas, dark, generalized, fluid 5 pd = wρd dark energy (Q18343) 4/3 0.3, 0.0, 0.3, 0.3 es : pX = ωX ρX energy, dark, equation, represent, pressure 6 H = ȧ/a N/A (Q179916) 4/4 0.4, 0.1, 0.2, 0.3 êm : H = a0 /a scale, factor, usual, equation, state 7 k = |k| wavenumber (Q192510) 3/3 0.8, 0.0, 0.2, 0.0 em : k = |k| oscillatory, behavior, depend, time, wavenumber 8 f = e−φ R N/A (N/A) 3/2 1.0, 0.0, 0.0, 0.0 em : f (φ) = e−φ R string, lowenergy, effective, action, theory 9 p = κρ equation of state (Q214967) 3/2 0.3, 0.0, 0.7, 0.0 es : pD = w(z)ρD equation, state, ary, patch, exceeds 10 w = pX /ρX equation of state (Q214967) 3/3 0.6, 0.0, 0.1, 0.3 em : pX = wX ρX equation, state, dark, energy, wmap 11 µ = mp /me proton-to-electron mass ratio (Q2912520) 3 / 3 1.0, 0.0, 0.0, 0.0 em : mi = µmp ratio, proton, electron, masses, technique 12 φc = M/g critical value (Q2189464) 3/3 0.0, 0.0, 0.0, 0.0 N/A field, critical, value, takes 13 p = − ρAα chaplygin gas (Q5073250) 3/3 0.8, 0.0, 0.0, 0.2 em : p = −Aρ−α state, generalized, chaplygin, gas, equation 14 p = αρ polytropic gas (Q831024) 3/2 0.7, 0.0, 0.2, 0.2 ês : wα = pα /ρα constant, gas, cosmological, matter, polytropic 15 M = M e /Γ connected manifold (Q2721559) 3/3 0.0, 0.0, 0.0, 0.0 N/A multiply, connected, equally, quotient, manifolds 16 g(a) = 4(a)/a dark energy (Q18343) 3/2 1.0, 0.0, 0.0, 0.0 em : g(a) = ∆(a)/a models, dark, energy, growth, history 17 α = dns /d ln k N/A (Q192510) 3/3 1.0, 0.0, 0.0, 0.0 em : dns /d ln k = αs introduced, customary, notation, comoving, wavenumber 18 ψ = −iθ N/A (N/A) 3/2 0.0, 0.0, 0.0, 0.0 N/A real imaginary universe R 19 dt = a(η)dη N/A (Q11471) 2/2 0.5, 0.0, 0.3, 0.3 ês : t =a(η)dη time, related, cosmic, relation, overdot √ √ 20 ∆xmin = β lower bound (Q21067468) 2/2 1.0, 0.0, 0.0, 0.0 em : ∆xmin = h̄ β positive, constant, lower, bound, implies, dimensional i i 21 k = ap modes (N/A) 2/2 0.0, 0.0, 0.0, 0.0 N/A modes, comoving, obtained, scaling, coincide 22 ϕ = δAµ perturbations (Q911364) 2/2 0.0, 0.0, 0.0, 0.0 N/A note, valid, perturbations, gauge, theories 23 hab = gab − na nb metric (Q865746) 2/2 0.0, 0.0, 0.0, 0.0 N/A bulk, scalar, curvature, induced, metric 24 K = Kab hab brane (Q385601) 2/2 1.0, 0.0, 0.0, 0.0 em : K = Kαβ hαβ vector, field, unit, normal, brane p p 25 v = |dp/dρ| equation of state (Q214967) 2/2 1.0, 0.0, 0.0, 0.0 em : vc = dpc /dρc equation, state, suggests, effective, velocity √ 26 Q = GM limit (Q246639) 2/2 0.0, 0.0, 0.0, 0.0 N/A limit, rhoades, value, write 27 ζ = Hδφ/φ̇ N/A (Q10886678) 2/2 1.0, 0.0, 0.0, 0.0 em : R = (H/φ̇)δφψ curvature, perturbation, uniform, density, valid √ 28 mγ = e/ π photon mass (Q3198) 2/2 0.0, 0.0, 0.0, 0.0 N/A photon, mass, gauge, mechanism, schwinger R 29 dη = dt/a(t) conformal time (Q2482717) 2/2 0.6, 0.0, 0.1, 0.3 ês : t = a(η)dη conformal, time, ase, figure, fig 30 Tg = Ho tg N/A (Q126818) 2/2 0.0, 0.0, 0.0, 0.0 N/A dimensionless, factor, eq, extragalactic, object 31 H = a0 /a N/A (Q179916) 2/2 0.7, 0.0, 0.1, 0.2 ês : H = ȧ/a conformal, time, background, scale, factor 32 θ = A exp(−ζt) exponential decrease (Q574576) 2/2 0.0, 1.0, 0.0, 0.0 êm : ψ(t, r) = ψ(r) exp(−iωt) decreases, exponentially, slowly 33 pi = ωi ρi N/A (N/A) 2/2 0.7, 0.0, 0.1, 0.1 es : wX = pX /ρX case, expected, current, observations, restrict 34 i∂t Φ = HΦ schrödinger evolution (Q165498) 2/2 0.0, 0.0, 0.0, 0.0 N/A evolution, shrödinger 35 H(t) = ȧ/a N/A (Q179916) 2/2 0.8, 0.1, 0.0, 0.1 em : ȧ = aH data, scale, function, combined, sn 36 pΛ = −ρΛ dark energy (Q18343) 2/2 1.0, 0.0, 0.0, 0.0 em : pD = −ρD dark, contributions, matter, energy, matterdominated 37 PM = wρM equation of state (Q214967) 2/2 0.6, 0.0, 0.3, 0.1 es : px = wρx pressure, write, related, equation, state 38 fν = ρν /ρd neutrino (Q2126) 2/2 0.0, 0.0, 0.0, 0.0 N/A matter, neutrino 39 At = rAs fluctuation (Q5462624) 2/2 0.0, 0.0, 0.0, 0.0 N/A fluctuation 40 pm = γρm nonrelativistic matter (Q55921784) 2/2 1.0, 0.0, 0.0, 0.0 em : γ = p/ρ matter, components, universe, nonrelativistic, ordinary 41 Ωi = ρi /ρc expansion rate (N/A) 2/2 1.0, 0.0, 0.0, 0.0 em : Ω = ρ/ρcrit universe, constant, rate, expansion, variables n 42 P (k) = Ak inflation (Q273508) 2/2 0.0, 0.0, 0.0, 0.0 N/A fluctuations, field, inflation, universe, inflationary 43 LI = M (τ )φ[x(τ )] N/A (N/A) 2/2 0.0, 0.0, 0.0, 0.0 N/A idea, quantitative, viewpoint, arises, study ab 44 L = κhab T N/A (N/A) 2/2 0.0, 0.0, 0.0, 0.0 N/A standard coupling 45 wi = Pi /ρi equation of state (Q214967) 2/2 0.7, 0.0, 0.2, 0.1 ês : wα = pα /ρα relative, contributions, components, equations, state 46 M̄ = B/C N/A (N/A) 2/2 0.3, 0.0, 0.3, 0.3 es : M̄ = B C minimum 47 Ψ = Ψ` + Ψs N/A (N/A) 2/2 0.0, 0.0, 0.0, 0.0 N/A split, dropped, note, long, short 48 z = aφ̇/H equation (Q21086835) 2/2 0.7, 0.0, 0.0, 0.3 ês : zq = aφ̇/H quantity, equation 49 uµ = dxµ /dτ comoving fluid (Q5462744) 2/2 1.0, 0.0, 0.0, 0.0 em : kµ = dxµ /dv cosmological, fundamental, observer, comoving, fluid 50 φ̇ = −Wφ firstorder differential equation (Q11214) 2/2 1.0, 0.0, 0.0, 0.0 em : χ̇ = −Wχ equation, firstorder, differential, scale, factor Towards Formula Concept Discovery and Recognition 7 References 1. 