<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Integrating Mathematics into the Web of Data</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science, Jacobs University Bremen</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Why Mathematics on the Web of Data?</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Mathematics, a ubiquitous foundation of science, technology, and engineering, is currently missing from the Web of Data. We argue why future internet applications would benefit from mathematical Linked Open Data. Due to the predominance of semantic XML-based markup languages in mathematics - which has good reasons -, contributing mathematical knowledge to the Weeb of Data cannot exclusively rely on RDF. While that is achievable within the current Linked Data architecture, better application-layer support would be desirable. This is a summary of our position; we refer to [26] for full background.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>foundations well represented on the Web of Data would enable the following
applications:
General-Purpose Mathematical Knowledge: Wikipedia states the Pythagorean
theorem as a2 + b2 = c2 (in LATEX) and categorizes it as “Article containing
proofs” and “Mathematical theorem”. A search for the semantically equivalent
expression z = px2 + y2 would fail. From the categorization it is not clear
for a machine whether the article contains a [correct] proof of that theorem.
The same restrictions apply to DBpedia, the Linked Dataset obtained from
Wikipedia [16]. That forced the Polymath collaborators to search for previous
publications of refutations of P 6= NP “proofs” by keyword.</p>
      <p>Statistics: Public sector information, increasingly being published as Linked
Data by governments, has been used to provide, e.g., localized information
retrieval about crime statistics and hospital waiting list statistics [32].
Statistical datasets contain values derived from ground values, or from other
derived values using mathematical functions. As planning data collection and
interpreting collected data requires mathematical models, statistical datasets
need a notion of mathematical semantics [45, 27].</p>
      <p>Publication Databases: The RKB Explorer Linked Dataset [2] classifies ACM’s
scientific publications according to the ACM Computing Classification
System [42]. Still, it is impossible for a Linked Data agent to understand that
a publication merely classified as “F.1.3 Complexity Measures and Classes”
deals with the P and NP complexity classes, and how they are defined.
Enterprise Applications: Linked Data do not have to be open; the
architecture also works in enterprise intranets. Renault has used them for retrieving
information about spare car parts [37]. Now consider decisions to be made
when designing whole cars: They ultimately require mathematical
understanding. An engineer looking for an efficient engine for a projected city car might
feed inputs such as the weight of the car, the average length and duration of
a trip, the most widely available type of fuel and the average environment
temperature when starting the engine into a mathematical model of the
engine in order to predict its fuel consumption under these constraints.
e-Science: The above use case is actually about reproducing an experiment –
one of the key principles of e-Science [5]. Fine-grained reproducibility once
more demands a representation of the mathematical models. Some e-science
datasets include them, e.g. the SysMO SEEK “‘assets catalogue’ describing
data, models, [. . . ], workflows and experiment[s]” [5] from systems biology of
microorganisms [41]. Publishing that as Linked Data is in progress (cf. [5]).
Currently, the mathematical models are given as Content MathML formulae
deeply nested into XML files and thus not directly accessible via URIs.
3</p>
    </sec>
    <sec id="sec-2">
      <title>Representing Mathematical Knowledge on the Web</title>
      <p>Where mathematical knowledge is currently represented machine-comprehensibly,
it is usually done in the native languages of computer algebra systems or proof
assistants, or in Content MathML (the “semantic” sublanguage of MathML
[4]) or OpenMath [11] semantic XML markup languages. Translations from the
native languages of many mathematical systems to these exchange languages
and vice versa are available. Content MathML represents formulae as functional
trees. It has a built-in vocabulary of symbols (operators, functions, etc.) from
high-school and introductory university mathematics and relies on the
OpenMath Content Dictionary (CD) extension mechanism otherwise. OpenMath CDs
are mathematical domain ontologies. A canonical URI format for symbols in
CDs ensures minimum Semantic Web compatibility; e.g., the addition operator
– the plus symbol in the arith1 CD – has the URI http://www.openmath.
org/cd/arith1#plus. The official peer-reviewed CD collection defines 260
symbols from arithmetics, set theory, first-order logic, algebra, calculus, as well
as transcendental and statistical functions [34].1</p>
      <p>Ontologies for representing mathematical knowledge in RDF exist, sometimes
derived from these markup languages [26]. Despite standard URI formats for
mathematical symbols, formulae have always constituted a problem. Their n-ary
ordered tree structure precludes a straightforward RDF representation. Linked
lists or ordered sets – RDF’s built-in collections or sequences [6] or self-made
remakes – are unavoidable. These are, however, badly supported by software and
do not go well along with querying2 and DL reasoning3. Full RDF representations
of formulae can be found in the N3 language supported, e.g., by the cwm [8]
reasoner; however, the syntactic sugar that makes these representations relatively
intuitive is not available in plain RDF. Encodings of Content MathML, using
RDF collections, have been suggested (see, e.g., [36]), but not adopted in practice.</p>
      <p>An alternative – suggested once [10], but never done in practice – is
representing formulae as Content MathML/OpenMath XML literals in RDF while
representing anything else in pure RDF. However, XML literals are largely
opaque to contemporary RDF tools. Virtuoso [33] allows for filtering XML
literals matched by a SPARQL graph pattern by XPath node tests [7]. Corese can
additionally reuse variables from the proper SPARQL part of a query in XPath
expressions [13]. None of these extensions has made it into SPARQL yet.</p>
      <p>RDFa [1] would allow for leaving the representation of n-ary and ordered
structures to the embedding XML. The RDFa 1.1 API [39], once implemented by
browsers, will at least give in-browser scripts similar means of accessing embedded
RDF as the Document Object Model (DOM) does for XML. The XSPARQL [3]
query language combines SPARQL and XQuery; however, queries would still rely
1 The OpenMath CD language with its support for semi-formal descriptions of symbols
and their mathematical properties is sufficiently expressive to allow for simple
computations and to create requirements specifications for implementors of the
above-mentioned translations from and to mathematical systems, but not to support
fully automated inference. The OMDoc language [31, 25] adds the latter layer to
OpenMath and also builds a bridge to LATEX by covering informal text.
2 At least support for querying RDF collections, which some query processors already
support by non-standard extensions, will be standardized in SPARQL 1.1 [22].
3 In OWL one has to avoid RDF collections, as the RDF encoding of OWL uses them
internally to represent n-ary expressions. Self-made linked lists work around that [18].
on a separate service that makes the RDFa annotations available as queryable
RDF. RDFa has not yet been used for representing formulae either. Neither the
MathML nor the OpenMath developers are currently planning to support RDFa.4</p>
      <p>The only approach that has actually been employed in practice is standoff
markup. An RDF graph points to XML fragments and adds information to them,
e.g. additional metadata or links not supported by the respective XML language,
whereas n-ary structures and order are only represented in XML. Most of the
knowledge is represented redundantly in RDF and XML – one of them possibly
automatically generated from the other one – to provide maximum information
to agents that only understand one representation. For example, the HELM
and MONET RDF representations of formulae have focused on symbols in key
positions, e.g. the root of the assumption part of an inference rule, for the purpose
of, e.g., finding applicable theorems, or for matching mathematical problems (e.g.
definite integration) to web services solving them.
4</p>
      <p>Challenges to Publishing Mathematical Linked Data
This section summarizes the ways of publishing mathematical Linked Data that
we have explored so far and the challenges that we have encountered in doing so.</p>
      <p>In [45, 27], we describe a scenario where an agent accesses both RDF datasets
and OpenMath CDs by dereferencing URIs and using content negotiation. The
rules for computing derived values in statistical datasets are represented as RDF
annotations pointing to a function – a symbol from an OpenMath CD – and
other values from the dataset that should be passed as arguments to the function.
An agent that wants to verify a derived value has to construct an OpenMath
formula from this RDF representation and send it to an OpenMath computation
service. For functions called using named arguments, the agent has to consult
the XML representation of the CD to get their order right.</p>
      <p>In a different setting, we have published human-readable XHTML+MathML
documents, where semantic annotations – OpenMath for formulae and RDFa for
the rest – act as anchors for assistive services that provide additional information
on demand or adapt the presentation of a document [15]. This can be combined
with XML/RDF content negotiation. We have realized interactive declaration
lookup for symbols in formulae by dereferencing their canonical URIs (pointing to
a symbol declaration in a CD) from the formula’s annotation and transforming the
OpenMath declarations thus obtained to human-readable Presentation MathML.</p>
      <p>Besides the above-mentioned restrictions in querying XML/RDF combinations,
we have identified three challenges to publishing mathematical Linked Data:
Missing MIME Types: Content negotiation distinguishes representation
formats by MIME type. MathML 3 has officially registered MIME types [4],
whereas an OpenMath MIME type has merely been proposed so far [27].
4 This may be due to the fact that MathML is already at least as expressive as RDFa.</p>
      <p>Besides Content MathML for functional trees, it has a fine-grained general-purpose
annotation mechanism, which has actually existed long before RDFa. OpenMath has
a similar annotation syntax, albeit without URI support.</p>
      <p>Bad Authoring Practices – from a Linked Data point of view – result from
authors of mathematical XML markup using URIs wrongly or not at all.
Hardly any community-contributed OpenMath CDs specifies a base URI or
references symbols by absolute URIs, which indicates a lack of awareness. The
fallback base URI http://www.openmath.org/ is, even independently
from Linked Data considerations, not suitable for CDs from developers who
do not own the openmath.org domain. Finally, authors who are aware of
CDs and symbols having URIs usually merely consider them globally unique
names without relevance for retrieving information about these resources [27].
URI Format Restrictions: Thirdly, while RDF publishers can freely choose
URIs [9], the URI formats of non-RDF languages often impose restrictions
that complicate Linked Data publishing and have to be worked around.
OpenMath’s base/cd#symbol URI schema complies well with Linked Data
practices – unless CDs grow large. As resolving fragments is up to the client,
consequently using hash URIs forces clients to always download a complete
CD, in which they could then locate the desired symbol. Publishers of large
CDs could work around that by serving, upon an initial request, an RDF
graph that merely redirects, via rdfs:seeAlso links, hash to slash URIs, from
which the client would then be able to retrieve the desired fine-grained
information.5 A final problem with old languages such as the OpenMath CD
language, is that certain entities – e.g. properties of symbols – cannot be
given IDs. An XML→RDF translator might generate some, but an agent
interested in retrieving XML representations would also need them. As a
use case for such fine-grained links, consider the (currently Web 1.0) Digital
Library of Mathematical Functions (DLMF [30]). It contains a large number
of equations describing or defining mathematical functions, which could be
linked to the corresponding properties in OpenMath CDs.
5</p>
    </sec>
    <sec id="sec-3">
      <title>Conclusion and Future Work</title>
      <p>We have made the case for mathematical knowledge on the Web of Data by
outlining potential applications, and mentioned current challenges to publishing
mathematical Linked Data. Doing so requires combining XML and RDF,
particularly due to the inherent structural complexity of mathematical formulae and the
key role of XML exchange languages in mathematics. Publishing mathematical
knowledge is feasible within the current Linked Data architecture, but it would
benefit from better application-layer support for dealing with combinations of
XML and RDF, and with impractically restricted legacy URI formats.</p>
      <p>We conclude with an agenda towards publishing relevant mathematical
datasets. The probably most foundational dataset needed to get mathematics on
the Web of Data started is the official OpenMath CDs. The initial publication,
planned for spring 2011, is, however, only the first step in making the knowledge
from the CDs accessible; the second step is linking mathematics-related existing
datasets to the OpenMath CDs, so that services for these existing datasets can
5 See [26] for a discussion of further problems, also in related languages.
be extended by mathematical functionality – as outlined for statistical datasets
above. DBpedia is a further candidate, as that would offer its large audience a
more formal perspective on mathematics.</p>
      <p>Mathematical knowledge collections that are already available on the Web, but
not currently in a semantic representation, should also be triplified, at least with
shallow mathematical metadata and links to relevant OpenMath CDs. The DLMF
could, e.g., benefit from access to OpenMath computation services, whereas the
benefit for PlanetMath would be similar as for DBpedia. We should also take a
serious view on the April fool’s joke “Linked Open Numbers”, a huge dataset
describing billions of natural numbers [44]. It provides descriptions as trivial
as the name of each number in natural language, and its successor. But how
about a dataset of non-trivial properties of numbers? Accessing, e.g., prime factor
decompositions of large numbers – an information relevant for cryptography – in
a linked dataset, could be much faster than computing it once more, provided a
supercomputer has already done the computation once and published the results.
From an RDF reasoning and querying point of view, such a dataset could serve as
an oracle, providing information whose original computation would by far exceed,
e.g., description logic reasoning. Another such source of non-trivial knowledge
about numbers is the Web 1.0 Online Encyclopedia of Integer Sequences [38].</p>
      <p>Related to information retrieval and computation is the development of
suitable query languages and reasoners. N3 reasoners already support a basic set
of mathematical functions; SPARQL 1.1 supports basic arithmetics. Additionally,
many query processors allow for defining extension functions; a path for supplying
arbitrary functions to query processors via OpenMath should be investigated.
Such an extension could even be specified formally as an entailment regime [20]6.</p>
      <p>The arχiv offers a path towards publishing mathematical structures of
scientific publications. A long-term effort to automatically recover their mathematical
structure from the LATEX sources is in progress [19], the translation of 300,000
publications to somewhat more semantically structured XHTML+MathML
being a first success [40]. The publications have stable URIs, and their metadata
are available as XML, which makes a Linked Data publication feasible right
now. Next, much harder steps would be interlinking with existing publication
datasets, e.g. DBLP, and identifying mathematical symbols that could be linked
to the OpenMath CDs. With an identification of inference structures, this would
ultimately enable machine support for the next collaborative Web-based review
of a P 6= NP proof.
[1] RDFa Core 1.1. Syntax and processing rules for embedding RDF through
attributes. Working Draft. url: http://www.w3.org/TR/rdfa-core/.
[2] acm.rkbexplorer.com. url: http://acm.rkbexplorer.com.
[3] W. Akhtar et al. “XSPARQL: Traveling between the XML and RDF worlds
– and avoiding the XSLT pilgrimage”. In: ESWC. Springer, 2008.
6 originally suggested by Denny Vrandečić
[4] Mathematical Markup Language (MathML) 3.0. Recommendation. url:
http://www.w3.org/TR/MathML/.
[5] S. Bechhofer et al. “Why Linked Data is Not Enough for Scientists”. In:
6th IEEE e-Science conference. 2010.
[6] RDF/XML Syntax Specification (Revised). Recommendation. url: http:
//www.w3.org/TR/rdf-syntax-grammar/.
[7] XML Path Language (XPath) 2.0. Recommendation. url: http://www.</p>
      <p>w3.org/TR/xpath20/.
[8] T. Berners-Lee. cwm. a general purpose data processor for the semantic
web. url: http://www.w3.org/2000/10/swap/doc/cwm.html.
[9] Best Practice Recipes for Publishing RDF Vocabularies. Working Group</p>
      <p>Note. url: http://www.w3.org/TR/swbp-vocab-pub/.
[10] S. Buswell. RDF/XML Test cases for RDF Logic, Web Ontology and Maths
content. Deliverable 5.3b. SWAD-Europe, 2001. url: http://www.w3.
org/2001/sw/Europe/reports/xml_test_cases/wp53.html.
[11] S. Buswell et al. The Open Math Standard 2.0. Tech. rep. The OpenMath</p>
      <p>Society, 2004. url: http://www.openmath.org/standard/om20.
[12] Connexions. url: http://cnx.org.
[13] O. Corby et al. Querying the Semantic Web of Data using SPARQL,
RDF and XML. Tech. rep. 6847. INRIA Sophia Antipolis, Feb. 2009. url:
http://hal.inria.fr/docs/00/36/23/81/PDF/RR-6847.pdf.
[14] H. Cuypers et al. “MathDox, a system for interactive Mathematics”. In:
ED</p>
      <p>MEDIA. AACE, June 2008. url: http://go.editlib.org/p/29092.
[15] C. David et al. “Publishing Math Lecture Notes as Linked Data”. In: ESWC
(Part II). Springer, 2010.
[16] DBpedia. url: http://dbpedia.org.
[17] Deolalikar P vs NP paper. url: http : / / michaelnielsen . org /
polymath1/index.php?title=Deolalikar_P_vs_NP_paper&amp;
oldid=3654.
[18] N. Drummond et al. “Putting OWL in Order: Patterns for sequences in</p>
      <p>OWL”. In: OWL: Experiences and Directions (OWLED). Nov. 2006.
[19] D. Ginev et al. “An Architecture for Linguistic and Semantic Analysis on
the arXMLiv Corpus”. In: AST Workshop at Informatik. 2009. url: http:
//www.kwarc.info/projects/lamapun/pubs/AST09_LaMaPUn+
appendix.pdf.
[20] SPARQL 1.1 Entailment Regimes. Working Draft. url: http://www.w3.</p>
      <p>org/TR/sparql11-entailment/.
[21] T. Gowers and M. Nielsen. “Massively collaborative mathematics”. In:</p>
      <p>Nature 461.15 (2009).
[22] SPARQL 1.1 Query Language. Working Draft. url: http://www.w3.</p>
      <p>org/TR/sparql11-query/.
[23] HELM. Hypertextual Electronic Library of Mathematics. url: http://
helm.cs.unibo.it.
[24] Math-Net. an International Information and Communication System. url:
http://www.math-net.org.
[25] M. Kohlhase. OMDoc – An open markup format for mathematical
documents [Version 1.2]. LNAI 4180. Springer, Aug. 2006. url: http://
omdoc.org/pubs/omdoc1.2.pdf.
[26] C. Lange. “Ontologies and Languages for Representing Mathematical
Knowledge on the Semantic Web”. submitted to Semantic Web Journal.
2011. url: http://www.semantic-web-journal.net/content/
new-submission-ontologies-and-languages-representingmathematical-knowledge-semantic-web.
[27] C. Lange. “Towards OpenMath Content Dictionaries as Linked Data”. In:</p>
      <p>OpenMath Workshop. July 2010. arXiv:1006.4057v1 [cs.DL].
[28] Mizar Mathematical Library. url: http://www.mizar.org/library.
[29] MONET – Mathematics on the net. url: http://monet.nag.co.uk.
[30] Digital Library of Mathematical Functions. url: http://dlmf.nist.</p>
      <p>gov.
[31] OMDoc. url: http://omdoc.org.
[32] T. Omitola et al. “Put in Your Postcode, Out Comes the Data: A Case</p>
      <p>Study”. In: ESWC (Part I). Springer, 2010.
[33] OpenLink Software. OpenLink Universal Integration Middleware – Virtuoso</p>
      <p>Product Family. url: http://virtuoso.openlinksw.com.
[34] OpenMath CDs. url: http://www.openmath.org/cd/.
[35] PlanetMath.org – Math for the people, by the people. url: http : / /
planetmath.org (visited on 2010-07-14).
[36] A. Robbins. Semantic MathML. url: http : / / straymindcough .</p>
      <p>blogspot.com/2009/06/semantic-mathml.html.
[37] F.-P. Servant. “Linking Enterprise Data”. In: LDOW. CEUR-WS 369. Apr.</p>
      <p>2008. url: http://CEUR-WS.org/Vol-369/.
[38] N. J. A. Sloane. “The On-Line Encyclopedia of Integer Sequences”. In:</p>
      <p>Notices of the AMS 50.8 (2003).
[39] RDFa API. An API for extracting structured data from Web documents.</p>
      <p>Working Draft. url: http://www.w3.org/TR/rdfa-api/.
[40] H. Stamerjohanns et al. “Transforming large collections of scientific
publications to XML”. In: Mathematics in Computer Science 3.3 (2010): Special
Issue on Authoring, Digitalization and Management of Mathematical
Knowledge. url: http://kwarc.info/kohlhase/papers/mcs09.pdf.
[41] SysMO-DB SEEK. url: http://www.sysmo-db.org/seek/.
[42] The 1998 ACM Computing Classification System. 1998. url: http://
www.acm.org/about/class/ccs98.
[43] arXiv.org e-Print archive. url: http://www.arxiv.org.
[44] D. Vrandečić et al. “Leveraging Non-Lexical Knowledge for the Linked
Open Data Web”. In: RAFT. 2010. url: http://km.aifb.kit.edu/
projects/numbers/linked_open_numbers.pdf.
[45] D. Vrandečić et al. “Semantics of Governmental Statistics Data”. In: WebSci.
2010. url: http://journal.webscience.org/400/.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>