<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Generating OpenMath Content Dictionaries from Wikidata</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Moritz Schubotz Dept. of Computer and Information Science, University of Konstanz</institution>
          ,
          <addr-line>Box 76, 78464 Konstanz</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <abstract>
        <p>OpenMath content dictionaries are collections of mathematical symbols. Traditionally, content dictionaries are handcrafted by experts. The OpenMath speci cation requires a name and a textual description in English for each symbol in a dictionary. In our recently published MathML benchmark (MathMLBen), we represent mathematical formulae in Content MathML referring to Wikidata as the knowledge base for the grounding of the semantics. Based on this benchmark, we present an OpenMath content dictionary, which we generated automatically from Wikidata. Our Wikidata content dictionary consists of 330 entries. We used the 280 entries of the benchmark MathMLBen, as well as 50 entries that correspond to already existing items in the o cial OpenMath content dictionary entries. To create these items, we proposed the Wikidata property P5610. With this property, everyone can link OpenMath symbols and Wikidata items. By linking Wikidata and OpenMath data, the multilingual community maintained textual descriptions, references to Wikipedia articles, external links to other knowledge bases (such as the Wolfram Functions Site) are connected to the expert crafted OpenMath content dictionaries. Ultimately, these connections form a new content dictionary base. This provides multilingual background information for symbols in MathML formulae.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>3. it is di cult to identify the reasons for the limited performance of a system.</p>
      <p>
        As an alternative, we suggest breaking down complex MathIR tasks into several subtasks. The rst subtask is
to convert formulae from their source representation to a machine-readable format which describes the semantics.
With our MathMLBen project [11, 13], we created an open gold standard for this task. MathMLBen provides a
list of over 300 formulae and their textual contexts. The example formulae include formulae that were used in
the NTCIR search tasks [1, 2, 3], as well as formulae from the DLMF [
        <xref ref-type="bibr" rid="ref3">5</xref>
        ] and DRMF [4]. For all formulae, the
original LaTeX representation, a corrected LaTeX form (that corrects typographic errors in the layout), and a
semantic LaTeX interpretation is given. From the semantic LaTeX representation, we generate parallel content
and presentation MathML markup using LaTeXML [8]. We consider the generated MathML as a rst version
of a machine-readable format which describes the semantics and use it to measure the e ectiveness of di erent
systems for the task described above [13]. However, this machine readable format is not yet compatible with
the OpenMath standard. In Section 2, we describe how we generated a rst version of the Wikidata content
dictionary wikidata.ocd contains all symbols occurring in our gold standard. However, not all symbols of
our gold standard were associated with Wikidata items, but with standard OpenMath symbols. In Section 3
we experiment with exchanging those standard symbols with Wikidata items and analyse the e ects. Finally,
Section 4 concludes the paper and points out future works.
2
      </p>
      <p>The</p>
      <p>rst version of a Wikidata content dictionary
As introduced earlier, a key feature of MathMLBen are special LaTeX(ML) macros [11]. Those macros link
csymbol-elements in the MathML to entries in Wikidata. For example, one element of our gold standard
includes the logistic function f (x) = 1+e kL(x x0) : We did not nd a symbol for the logistic function in the
OpenMath content dictionaries1. However, articles regarding the logistic function can be found in many di erent
languages in Wikipedia. Moreover, the Wikidata item Q1052379 connects all these Wikipedia articles. This
item was manually improved by 19 users (excluding bots). Through these improvements, additional data such as
properties, relations to other items and external identi ers were added. Figure 1 shows a screenshot of the item,
including statements, external identi ers, and links to Wikipedia articles as well as other Wiki projects such
as Wikisource and Wikiversity. In our gold standard, we used the semantic LaTeX macro2 \wf{Q1052379}{f}
1A list of all OpenMath symbols is available from https://www.openmath.org/symbols/.</p>
      <p>2 The di erence between the macros nw and nwf is that nwf associates the role function with a symbol. For example, the rst
invisible operator in f (a + b) is interpreted as function application rather than multiplication, if nwf is used [11].
to encode that LaTeXML should treat the symbol f as &lt;csymbol cd="wikidata"&gt;Q1052379&lt;/csymbol&gt; [11].
However, the content dictionary wikidata that the csymbol element refers to with its cd attribute did not exist.</p>
      <p>Currently, there are more than 49 million items in Wikidata. Not all items are part of mathematical formulae.
Thus, creating a Wikidata content dictionary based on all items would be impractical. In this paper, we present a
Wikidata content dictionary that includes all the 280 entries used in MathMLBen. We call our content dictionary
wikidata.ocd. It can be downloaded from https://cd.formulasearchengine.com/wikidata.ocd.</p>
      <p>Listing 1 shows the content dictionary entry for the symbol logistic function (Q1052379). All 280 symbols
follow the same pattern. This is because they generated them with MathTools [6] using the Wikidata item
numbers given in the csymbol-elements in MathMLBen. The description includes the English label of the
Wikidata entry (line 1366), the English description (not available in Listing 1), a link to the English Wikipedia
article (line 1367) and nally a static link to the version of the Wikidata item that was used to create the content
dictionary entry (line 1370). While this description is currently in English, the language is only a con guration
parameter in our tool. Theoretically, any other language could be used. However, the success depends on the
number of available community-maintained texts in that language (cf. Section 3).</p>
      <p>Our content dictionary wikidata.ocd improves the standard compliance of the MathMLBen gold standard.
It provides a content dictionary for third party MathML processing software that can read user-contributed
content dictionaries without requiring a special implementation to fetch data from Wikidata. This improves the
standard compliance the MathMLBen gold standard.
3</p>
    </sec>
    <sec id="sec-2">
      <title>Using Wikidata as cdbase</title>
      <p>After having discussed how to automatically derive a content dictionary from a set of Wikidata items, we discuss
how to create a cdbase that contains the standard OpenMath symbols from Wikidata in this chapter.</p>
      <p>As Figure 2 shows, the nature of content dictionaries and Wikipedia pages (here in English) is di erent.
While the CD description is brief in human-readable content, the Wikipedia page shows a lot of human-readable
information. On the other hand, there are formal mathematical properties that are hard to extract from the
Wikipedia page. Consequently, we analyse the strength and weaknesses of both approaches in the following.
However, before doing so, we need to map entries in Wikidata to OpenMath. We therefore proposed a new
property in Wikidata (P5610) of type external identi er which was approved on August 9th 2018. This identi er,
labeled OpenMath ID, allows one to refer to OpenMath symbols from within Wikidata. That way, we connect
both communities, Wikidata and OpenMath. Everyone (even without a Wikidata account) is now able to create
new mappings.</p>
      <p>
        The new property P5610 is of type string and has three constraints: a single-value, a unique value, and a
regexp lter ([a-z]+[
        <xref ref-type="bibr" rid="ref3">0-9</xref>
        ]*)n#([a-z ]+) constraint. These constraints prevent common mistakes. Table 1 lists
the 50 standard OpenMath symbols that occur in the MathMLBen project. We uploaded these manually created
mappings on August 10th, 2018 to Wikidata. Consequently, we now have the opportunity to describe all symbols
that occur in the gold standard in terms of Wikidata without referring to any OpenMath de nitions. One can
now create a redirect service which redirects traditional MathML and OpenMath IDs such as &lt;plus&gt; which
correspond to arith1#plus to the associated Wikidata entry (e.g., Q32043 for plus). According to the MatML
standard (Section 4.2.3.2) the URI of a de nition can be given as URI = cdbase + '/' + cd-name + '#' +
symbol-name. The SPARQL interface query.wikidata.org allows one to nd an item that is associated with the
last part of the url (cd-name#symbol-name). For instance, the query for arith1#plus reads SELECT ?x WHERE
f ?x wdt:P5610 'arith1#plus'.g which returns the URL http://www.wikidata.org/entity/Q32043, the
Wikidata item for plus. To materialise the results, we used the method described in Section 2 to generate
CDDefintions for the 50 standard symbols.
      </p>
      <p>Listing 2 shows an entry in wikidata.ocd that corresponds to the Wikidata item plus. The description section
contains more information than Listing 1. Line 269 is the English description from the item that represents the
addition in Wikidata. Moreover, line 272 links to the de nition form of arith1#plus from the OpenMath content
dictionary.</p>
      <p>For the remainder of the section, we discuss the di erences between traditional OpenMath symbol de nition
entries and Wikidata generated symbol de nitions. Our Wikidata content dictionary contains 330 symbols
in a single content dictionary. In contrast, there are 289 o cial OpenMath symbols divided into 38 content
dictionaries. 247 OpenMath symbols have a role attribute (application 193, constant 39, binder 3,
semanticattribution 2, error 3). While we did research on identifying Wikidata items as numerical constants [14], this
information is not included in the current version of wikidata.ocd. Moreover, the o cial OpenMath content
dictionaries contain 149 examples, 180 formal mathematical properties (FMP), and 179 commented mathematical
properties (CMP). Currently, wikidata.ocd does not contain any of the aforementioned features. Due to the
lack of time, the MathMLBen data has not been converted to the OpenMath XML format, which would be
required to create examples. Deriving reasonable CMP or FMP from Wikidata requires semantic enhancement
of the de ning formula statement which are currently only available in presentation form. The description eld
in the o cial OpenMath content dictionaries is on average 131 words, as compared to 212 words (including 14
words for the reference to the source) in wikidata.ocd. While Wikidata items are not divided into a structure
comparable to content dictionaries, they have hierarchical relations such as the instance of (P31) relation. As
displayed in Table 1, the instance of relation is not modelled consistently. We hypothesise that this is typical
for corpora which emerged from community interactions. Finally, the symbol names in the standard OpenMath
dictionaries are easier to remember for English speakers. Therefore, using IDE or smart editors is a prerequisite
to work with Wikidata items conveniently. Otherwise, the long numeric item identi ers are hard to read and to
remember. To support this purpose, we released the node module codemirror-wikidata [12], which provides
autocompletion based on the description rather than on the numeric values.
4</p>
    </sec>
    <sec id="sec-3">
      <title>Conclusions and Future Works</title>
      <p>In this paper, we released a rst version of the Wikidata content dictionary wikidata.ocd to the public. It
contains all the symbols used in the MathMLBen open gold standard. Moreover, it contains 50 entities that
correspond to standard OpenMath symbols. Furthermore, we introduced the new Wikidata property P5610
and described how it can be used to create an alternative cdbase. Also, we compared symbol descriptions
that were generated automatically from Wikidata to the manually crafted OpenMath symbol de nitions. While
multilingualism and links to Wikipedia might be considered as an advantage of the Wikidata cdbase, many other
formal aspects such as structure and type information are currently better modeled in the traditional OpenMath
content dictionaries. On the other hand, Wikidata has far more items that could be possibly used as symbol
de nitions.</p>
      <p>Future research should investigate how the missing formal information in wikidata.ocd can be automatically
extracted. If there was a mechanism to generate content dictionaries from Wikidata that have the same formal
quality as the current OpenMath content dictionaries, a good foundation for CD extension, based on
Wikidata, would ease the expansion of the OpenMath standard. Another promising research direction is to better
understand how information given in distributed data sources can be connected using alignments [7, 9, 10].
Acknowledgments
We thank the Wikimedia Foundation and Wikimedia Deutschland for providing cloud computing facilities and
for providing o ce space for us. This work was supported by the FITWeltweit program of the German Academic
Exchange Service (DAAD) as well as the German Research Foundation (DFG grant GI-1259-1). The author
would like to thank Howard Cohl for constructive criticism of the manuscript.</p>
      <p>Akiko Aizawa et al. \NTCIR-12 Math-3 Task Overview". In: NTCIR. National Institute of Informatics
(NII), 2016.</p>
      <p>Howard S. Cohl et al. \Growing the Digital Repository of Mathematical Formulae with Generic LATEX
Sources". In: Proc. CICM. Ed. by Manfred Kerber et al. Vol. 9150. Springer, 2015.</p>
      <p>Andre Greiner-Petter et al. \MathTools: An Open API for Convenient MathML Handling". In: 11th
Conference on Intelligent Computer Mathematics CICM, RISC, Hagenberg, Austria. RISC, Hagenberg, Austria,
Aug. 2018.</p>
      <p>Cezary Kaliszyk et al. \A Standard for Aligning Mathematical Concepts". In: Joint Proceedings of the
FM4M, MathUI, and ThEdu Workshops, Doctoral Program, and Work in Progress at the Conference
on Intelligent Computer Mathematics 2016 co-located with the 9th Conference on Intelligent Computer
Mathematics (CICM 2016), Bialystok, Poland, July 25-29, 2016. Ed. by Andrea Kohlhase et al. Vol. 1785.
CEUR-WS.org, 2016.</p>
      <p>Bruce Miller. LaTeXML: A LATEX to XML/HTML/MathML Converter. Web Manual at http://dlmf.nist.
gov/LaTeXML/. Seen 2018.</p>
      <p>Dennis Muller et al. \Alignment-based Translations Across Formal Systems using Interface Theories". In:
Proceedings of the Fifth Workshop on Proof eXchange for Theorem Proving, PxTP 2017, Brasilia, Brazil,
23-24 September 2017. Ed. by Catherine Dubois and Bruno Woltzenlogel Paleo. Vol. 262. 2017.</p>
      <p>Dennis Muller et al. \Classi cation of Alignments Between Concepts of Formal Mathematical Systems".</p>
      <p>In: Proc. CICM. Ed. by Herman Geuvers et al. Vol. 10383. Springer, 2017.
[11] Philipp Scharpf, Moritz Schubotz, and Bela Gipp. \Representing Mathematical Formulae in Content
MathML using Wikidata". In: Proceedings of the 3rd Joint Workshop on Bibliometric-enhanced
Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL 2018) co-located with
the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval
(SIGIR 2018), Ann Arbor, USA, July 12, 2018. Ed. by Philipp Mayr, Muthu Kumar Chandrasekaran, and
Kokil Jaidka. Vol. 2132. CEUR-WS.org, 2018.</p>
      <p>OpenMath ID Instance of</p>
      <p>arith1#abs piecewise function, even function, idempotent function
arith1#divide binary operation</p>
      <p>arith1#gcd function
arith1#minus binary operation, operation
arith1#plus binary operation
arith1#power operation
arith1#root type of mathematical function, algebraic function
arith1#sum mathematical expression
arith1#times binary operation
arith1#unary minus
calculus1#diff unary operation, mathematical concept</p>
      <p>fns1#lambda Wikimedia disambiguation page
fns1#left compose operator, operation
hypergeo0#gamma function
integer1#factorial function
interval1#interval oo part
limit1#limit mathematical concept
limit1#null integer, Fibonacci number, triangular number,
automorphic number, even number, non-negative integer, 0
number class, non-positive integer
linalg1#determinant invariant</p>
      <p>linalg2#matrix array data structure, tensor
linalg2#matrixrow row and column vectors
linalg2#vector tensor
list1#list creative work
logic1#and logical connective, boolean function
logic1#equivalent transitive relation, symmetric relation, re exive relation
nums1#e real number, transcendental number, irrational number
nums1#i square root, mathematical constant, Gaussian integer,</p>
      <p>imaginary number
nums1#infinity mathematical concept
nums1#pi real number, transcendental number, mathematical
con</p>
      <p>stant, irrational number
relation1#approx relation, estimation
relation1#eq equivalence relation, partial order
relation1#geq inequation
relation1#gt inequation
relation1#leq inequation
relation1#lt inequation
relation1#neq inequality sign</p>
      <p>set1#in binary relation, subclass
set1#intersect binary operation, set operation
set1#set Wikidata metaclass, Wikidata metaclass, Wikidata</p>
      <p>metaclass, class (set theory), formalization, collection
transc1#arccos inverse trigonometric function, decreasing function
transc1#arctan inverse trigonometric function, increasing function
transc1#cos trigonometric function, even function
transc1#cosh hyperbolic function, even function
transc1#exp exponential function, type of mathematical function
transc1#ln
transc1#log
transc1#sin
transc1#sinh
transc1#tan
transc1#tanh
type of mathematical function, logarithm
type of mathematical function, type of mathematical
function, multivalued function, elementary
transcendental function
trigonometric function, odd function
hyperbolic function, odd function
trigonometric function
hyperbolic function
[12]
[13]
[14]</p>
      <p>Moritz Schubotz. \VMEXT2: A Visual Wikidata aware Content MathML Editor". In: Joint Proceedings
of the CME-EI, FMM, CAAT, FVPS, M3SRD, OpenMath Workshops, Doctoral Program and Work in
Progress at the Conference on Intelligent Computer Mathematics 2018 co-located with the 11th Conference
on Intelligent Computer Mathematics (CICM 2018). Ed. by Osman Hasan et al. 2018.</p>
      <p>Moritz Schubotz et al. \Improving the Representation and Conversion of Mathematical Formulae by
Considering their Textual Context". In: Proceedings of the 18th ACM/IEEE on Joint Conference on Digital
Libraries, JCDL 2018, Fort Worth, TX, USA, June 03-07, 2018. Ed. by Jiangping Chen et al. ACM, 2018.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Akiko</given-names>
            <surname>Aizawa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Kohlhase</surname>
          </string-name>
          , and Iadh Ounis.
          <source>\NTCIR-10 Math Pilot Task Overview"</source>
          .
          <source>In: Proceedings of the 10th NTCIR Conference on Evaluation of Information Access Technologies</source>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Akiko</given-names>
            <surname>Aizawa</surname>
          </string-name>
          et al. \NTCIR-11 Math-2
          <string-name>
            <surname>Task Overview</surname>
          </string-name>
          <article-title>"</article-title>
          .
          <source>In: Proceedings of the 11th NTCIR Conference on Evaluation of Information Access Technologies</source>
          , NTCIR-
          <volume>11</volume>
          , National Center of Sciences, Tokyo, Japan, December 9-
          <issue>12</issue>
          ,
          <year>2014</year>
          . National Institute of Informatics (NII),
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>F.W.J.</given-names>
            <surname>Olver</surname>
          </string-name>
          et al., eds.
          <source>NIST Digital Library of Mathematical Functions</source>
          . http : / / dlmf . nist . gov/,
          <source>Release 1.0</source>
          .14 of 2017-
          <volume>12</volume>
          -21.
          <string-name>
            <given-names>F.W.J.</given-names>
            <surname>Olver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.B.</given-names>
            <surname>Olde Daalhuis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.W.</given-names>
            <surname>Lozier</surname>
          </string-name>
          ,
          <string-name>
            <surname>B.I. Schneider</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.F.</given-names>
            <surname>Boisvert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.W.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.R.</given-names>
            <surname>Miller</surname>
          </string-name>
          and
          <string-name>
            <given-names>B.V.</given-names>
            <surname>Saunders</surname>
          </string-name>
          , eds.
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>