=Paper=
{{Paper
|id=Vol-2982/paper-1
|storemode=property
|title=Mathematics in Wikidata
|pdfUrl=https://ceur-ws.org/Vol-2982/paper-1.pdf
|volume=Vol-2982
|authors=Philipp Scharpf,Moritz Schubotz,Bela Gipp
|dblpUrl=https://dblp.org/rec/conf/semweb/ScharpfSG21
}}
==Mathematics in Wikidata==
Mathematics in Wikidata? Philipp Scharpf1 , Moritz Schubotz2,3 , and Bela Gipp3 1 University of Konstanz, Germany philipp.scharpf@uni-konstanz.de 2 FIZ Karlsruhe, Germany moritz.schubotz@fiz-karlsruhe.de 3 University of Wuppertal, Germany gipp@uni-wuppertal.de Abstract. Documents from Science, Technology, Engineering, and Math- ematics (STEM) disciplines usually contain a significant amount of math- ematical formulae alongside text. Some Mathematical Information Re- trieval (MathIR) systems, e.g., Mathematical Question Answering (MathQA), exploit knowledge from Wikidata. Therefore, the mathemat- ical information needs to be stored in items. In the last years, there have been efforts to define several properties and seed formulae together with their constituting identifiers into Wikidata. This paper summarizes the current state, challenges, and discussions related to this endeavor. Fur- thermore, some data mining methods (supervised formula annotation and concept retrieval) and applications (question answering and clas- sification explainability) of the mathematical information are outlined. Finally, we discuss community feedback and issues related to integrating Mathematical Entity Linking (MathEL) into Wikidata and Wikipedia, which was rejected in 33% and 12% of the test cases, for Wikidata and Wikipedia respectively. Our long-term goal is to populate Wikidata, such that it can serve a variety of automated math reasoning tasks and AI systems. Keywords: Wikidata · Mathematical Information Retrieval · Mathe- matical Entity Linking · Mathematical Question Answering 1 Introduction Mathematical Information Retrieval (MathIR) systems, such as Document Rec- ommender (DocRec), Mathematical Question Answering (MathQA), and Auto- matic Document Classification (ADC), need to process and query mathemati- cal formulae. Since Wikidata has been proven useful as a semantic grounding database for Natural Language Processing (NLP) approaches and applications, it was consequential to transfer and adapt classical IR and NLP methods to the special case of mathematical knowledge. In 2016, we implemented support for mathematical properties, such as ‘defining formula’ (P2534), which were pro- ? Copyright © 2021 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). This work was supported by DFG grant GI-1259-1. 2 Scharpf et al. posed4 and used5 . Later, additional properties to include the semantics of the formula identifiers were added 6 . Our long-term goal is to build math.wikipedia.org, a large collaborative, semi- formal, machine-readable, language-independent mathematics encyclopedia. Its purpose will be to provide the backbone for automated reasoning tasks, concept entity linking, knowledge-graph population, question answering, and more. Table 1. Different mathematical Wikidata properties with their occurrence frequencies (as of July 19, 2021) and an example. Property Frequency Example ‘defining formula’ (P2534) 5166 E = m cˆ2 ‘in defining formula’ (P7235) 703 E ‘calculated from’ (P4934) 780 mass ‘has part’ (P527) 179 energy Table 1 shows the four most relevant and used properties for mathemati- cal concept items. While the ‘defining formula’ property is employed to store an entire formula (e.g., E=mc^2 in Q35875), ‘in defining formula’ (P7235), ‘cal- culated from’ (P4934), and ‘has part’ (P527) are used to denote the identifier information. The occurrence frequency numbers were obtained by running Wikidata SPARQL queries7 . For example, the number of items with ‘defining formula’ property can be retrieved using the following query snipped: #R e t r i e v e a l l i t e m s with ‘ d e f i n i n g formula ’ p r o p e r t y P2534 SELECT ? f o r m u l a WHERE { ? item wdt : P2534 ? f o r m u l a . } } Figure 1 illustrates (using the ‘mass-energy equivalence’ Q35875 item as an example) how the defining formula is displayed in the Wikidata user interface. Currently, as of July, 30th 2021, the usage frequency distribution is as shown in Table 1. The ‘has part’ property, which was historically used first was gradu- ally replaced by ‘calculated from’, which is now more than four times as promi- nent. For a discussion of the differences between the two properties and their individual limitations, see Section 3.3. In this resource paper, we discuss how the mathematical knowledge stored in Wikidata can be extended and employed for Mathematical Entity Linking (MathEL) and its applications, e.g., Mathematical Question Answering (MathQA) and Document Classification Explainability (DCE). 4 https://www.wikidata.org/w/index.php?title=Property:P2534&oldid=303933381 5 https://www.wikidata.org/w/index.php?title=Q35875&oldid=303968820 6 https://www.wikidata.org/w/index.php?title=Property:P4934&oldid=646697942 7 https://query.wikidata.org Mathematics in Wikidata 3 Fig. 1. Displaying the ‘defining formula’ property of the item ‘energy-mass equiva- lence’. This excerpt from the Wikidata item page shows where the LATEX formula string can be inserted. The remainder of this paper is structured as follows. In Section 2, we de- scribe how the knowledge can be distilled by annotating mathematical docu- ments (papers, articles, etc.). We show how this can be accelerated using an annotation recommender system. In Section 3, we present standards and sys- tems for benchmarking the knowledge for Mathematical Information Retrieval (MathIR) experiments. Mathematical Entity Retrieval and Linking methods are introduced, and community feedback on incorporating MathEL data into Wiki- data and Wikipedia is discussed. Section 4 outlines MathQA and DCE as two example applications of MathEL and concludes with an outlook to challenges and future work. 2 Mathematical Entity Annotation The process of Mathematical Entity Linking can be comprised of 1) Mathemat- ical Entity Annotation and 2) Mathematical Entity Retrieval. In this chapter, we start with 1) by presenting approaches for document annotation and its ac- celeration by annotation recommendation. 2.1 Document Annotation Document annotations are generally employed to provide additional informa- tion about a resource (e.g., comments) or to link resources (e.g., to URLs). The Web Annotation Data Model8 specifies the annotation model structure (id, type, property, relationship) in JSON format. Moreover, RDF classes and ontologies should be defined and serialized according to the Web Annotation Vocabulary9 . Several annotation tools and recommender systems for linked data have been de- veloped so far. Tietz et al. present a system for Wordpress [24] that recommends DBpedia resources and visualizes the annotation process. Users can explore back- ground information and relationships between named entities. Vagliano et al. provide a technical report [25] on semantic annotation of user reviews using DB- pedia and Wikidata. Purwitasari et al. introduce an ontology-based annotation 8 https://www.w3.org/TR/annotation-model 9 https://www.w3.org/TR/annotation-vocab 4 Scharpf et al. recommender for learning material [10] using Latent Semantic Analysis (LSA) and WordNet to determine the context of content categories, which are then structured into an ontology model. Lastly, Wiesing et al. developed an RDF annotation tool (KAT) specific for STEM documents in XHTML format [26]. 2.2 Annotation Recommendation To disambiguate and match mathematical expressions in Wikipedia articles to Wikidata items [12], the ‘AnnoMathTeX’ formula and identifier annotation rec- ommender system10 was developed. The system is designed to suggest Wiki- data item name and QID candidates provided from several sources, such as the arXiv11 , Wikipedia, Wikidata, or the text that surrounds the formula. In the first evaluation, it could be shown that 78% of the identifier name recommendations were accepted by the user. In additional experiments, the community acceptance of the Wikipedia article link and Wikidata item seed edits was assessed [15]. For 88% of the edited Wikipedia articles and 67% of the Wikidata items, the con- tributions were accepted. Moreover, the annotation could be accelerated by a speedup of factor 1.4 for formulae and 2.4 for identifiers. The ‘AnnoMathTeX’ system is ready to be integrated seamlessly into the Wikimedia user interfaces via a ‘MathWikiLink’ API. Fig. 2. The start screen of AnnoMathTeX, where the user can start or continue anno- tating selected Wikipedia articles. We presented the system with its applications at the Wikiworkshop21 (WWW21 conference) [15]. Figure 2 shows the User Interface of the ‘AnnoMath- 10 https://annomathtex.wmflabs.org 11 https://arxiv.org Mathematics in Wikidata 5 TeX’ system at the start, where Wikipedia Wikitext or arXiv LATEX articles can be selected, loaded, and deleted. If the user clicks on a formula or identifier in the loaded document , recom- mendations are displayed as shown in Figure 3 for the example formula F = m·a, which is seeded into Wikidata as the item ‘Newton’s second law of motion’ (Q3268014). Fig. 3. Popup table containing recommendations for the annotation of the formula ‘F = m · a’, provided from different sources (cut off after third ranked). Figure 4 shows an example where the formula name recommendation is very specific within a concept hierarchy. Fig. 4. Example popup table providing a very specific applicable recommendation, which is selected (highlighted in red). 2.3 Annotation Guidelines and Issues The purpose of the first testing phase of the system was to elaborate on how the mathematics knowledge contained in Wikipedia articles can be transferred to Wikidata statements. For the annotation, we developed the following annotation rules or guidelines: 6 Scharpf et al. – Annotate identifiers first, such that the formula name recommendation re- trieval from Wikidata via the ‘has part’ properties is enabled; – Do not annotate identifier describing objects, such as ‘gas’, ‘solid’, ‘line’ instead of quantities or constants; – Ignore derivative d characters, such as in d/dt and all indices (superscript or subscript); – Locally different meanings of the same identifier within an article should be avoided (appeal to editors); – Ignore block-level formulae that are not relations (equations, P inequations, etc.) or do not have a single identifier right-hand side, e.g., i Ii = i ri2 mi , P 0 = ..., dE = .... Also, ignore formulae in tables, and derivations; – Proper names (e.g., ‘Planck constant’) must be capitalized according to the conventions from ‘Content dictionary description’ (DRMF) [3]. During the annotation process, we discovered the following issues: – It is not possible to parse equations with no spaces between identifiers, e.g., in the right-hand side of the LATEX string ‘L = rmv’; – There are different common practices to denote vectors in LATEX, e.g., \vec vs. \mathbf; – There are different common practices for properties in Wikidata to include the semantics of the formula elements or identifier, e.g., ‘has part’ (P527) ‘calculated from’ (P4934) - see the discussion in Section 3.3; – Sometimes two names are both commonly used to denote the same Formula Concept, e.g., ‘M-sigma relation’ (Q3424023) and ‘Faber–Jackson relation’ (Q1390162); – In case the Wikidata item for a Formula Concept was missing, and we had to create it, we needed to reinsert the new QID into the annotations table manually. In the future, the process of discovering new issues and requirements to im- prove the system and extend the annotation guidelines will be continued. Wiki- media users can collaboratively contribute to this joint endeavor. 3 Mathematical Entity Linking 3.1 Mathematical Entity Benchmarking The open-source and open access formula benchmark system MathMLben 12 was introduced to facilitate the conversion between different mathematical formats such as LaTeX variations and Computer Algebra Systems (CAS) [19]. Figure 5 shows the Graphical User Interface (GUI) of the system, displaying the ex- pression tree of an example formula. Each formula identifier can be annotated with Wikidata QID macros. The annotation functionality was motivated by the potential to define semantic relatedness for formulae by counting Wikidata links 12 https://mathmlben.wmflabs.org Mathematics in Wikidata 7 between them [14]. The MathMLben database contains 375 expressions or for- mulae (GoldIDs) from Wikipedia, the arXiv, and the Digital Library of Math- ematical Functions (DLMF). The content is ranging from individual symbols to complex multi-line formulae. It additionally contains meta-information, such as the source URL or document page it is retrieved from. Expressions 1 to 100 are random samples taken from the National Institute of Informatics Testbeds and Community for Information access Research Project (NTCIR) 11/12 Math Wikipedia Task [1]. Expressions 101 to 200 are random samples taken from the NIST Digital Library of Mathematical Functions (DLMF) [6] available on the website https://dlmf.nist.gov containing around 10.000 labeled LaTeX formulae with semantic markup classified in 36 categories [2, 4]. Expressions 201 to 305 were selected from the NTCIR arXiv and NTCIR-12 Wikipedia dataset retrieval. 70 % of these formulae were taken from the arXiv and 30 % from a Wikipedia dump. The remaining formulae were extracted from an annotation of 25 selected Wikipedia articles from physics (classical mechanics) [15]. For each Gold ID entry or formula, there is an input field for the Formula Name, Formula Type (definition, equation, relation or general formula), Origi- nal Input TeX and manually Corrected TeX together with a Hyperlink to the source. The Semantic LaTeX Input field is used for the semantic anno- tations, as a grounding for the generation of Content MathML with Wikidata annotations by LaTeXML [9, 5]. The corrected TeX is rendered in real time by Mathoid [23]. Moreover, an expression tree is displayed, rendered by our visual- ization tool VMEXT [20]. For each symbol in the tree, the assigned annotation is shown as a yellow mouse-over infobox containing the Wikidata QID, name, and description (if available). The system includes a user guide on how to access raw data or contribute by extending or correcting the expression tree or (Wikidata) annotations. 3.2 Formula Concept Seeding and Retrieval In 2018, we first introduced linking mathematical formula content to Wikidata, both in MathML and LATEX markup [19, 14]. In 2019, we called out for a ‘Formula Concept Discovery (FCD) and Formula Concept Recognition (FCR) challenge’ to elaborate automated Mathematical Entity Linking. For our FCD approach, we could achieve a recall of 68% for retrieving equivalent representations of frequent formulae and 72% for extracting the formula name (assigned to a Wikidata item) from the surrounding text on the NTCIR arXiv dataset [1]. We defined a ‘Formula Concept’ as a ‘labeled collection of mathematical formulae that are equivalent but have different representations through notation, e.g., the use of different identifier symbols or commutations’ [13]. For example, the formula E = mc2 can be regarded as being one representation of the Formula Concept labeled ‘mass-energy equivalence’. A different representation of this same concept with different notation and rearrangement could be µ = /c2 . The following snipped exemplifies how Einstein’s famous formula E = mc2 , the item ‘mass-energy equivalence’ (Q35875) can be found via a SPARQL query. Based on the snippet, a formula search engine on Wikidata can be implemented. 8 Scharpf et al. Fig. 5. Graphical User Interface of MathMLben providing several TeX input fields (left) and a mathematical expression tree rendered by the VMEXT visualization tool (right) [19]. #R e t r i e v e a l l i t e m s with l a b e l , d e s c r i p t i o n , and formula , whose d e f i n i n g f o r m u l a p r o p e r t y ( P2534 ) c o n t a i n s t h e s t r i n g ‘E=mcˆ 2 ’ SELECT ? item ? i t e m L a b e l ? i t e m D e s c r i p t i o n ? definingFormula WHERE { ? item wdt : P2534 ? d e f i n i n g F o r m u l a ; FILTER( c o n t a i n s ( ? d e f i n i n g F o r m u l a , ‘E=mcˆ 2 ’ @en ) ) SERVICE w i k i b a s e : l a b e l } } Figure 6 illustrates how to make use of the ‘has part’ (P527) property to get all items with formula whose identifiers are annotated as ‘energy’ (Q11379) and ‘speed of light’ (Q2111). Based on the snippet, a semantic formula search engine on Wikidata can be implemented. 3.3 Community Feedback on Wikidata and Wikipedia In [15] we presented the evaluation of our AnnoMathTeX formula and identifier annotation recommender system on a selection of 25 Wikipedia articles from physics. The linked formula concepts were seeded into Wikidata and persisted in our formula benchmark system MathMLben (see Section 3.1). Mathematics in Wikidata 9 SELECT ?item ?itemLabel ?itemDescription WHERE { ?item wdt:P527 wd:Q11379. ?item wdt:P527 wd:Q2111 SERVICE wikibase:label { bd:serviceParam wikibase:language "en" .} } Fig. 6. SPARQL query making use of the ‘has part’ property. It returns all items that that are connected to ‘energy’ (Q11379) and ‘speed of light’ (Q2111) through the ‘has part’ property (P527). The formula linkings from the annotated Wikipedia articles were included in the Wikitext via qid attribute of the