<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Hagenberg, Austria</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Formula Concept Discovery and Recognition</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Philipp Scharpf Dept. of Computer</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Information Science Konstanz</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Germany philipp.scharpf@uni-konstanz.de</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <volume>1</volume>
      <fpage>3</fpage>
      <lpage>08</lpage>
      <abstract>
        <p>In my dissertation, I will develop a method to discover (de ne) and recognize (identify) formula concepts in Wikipedia articles and STEM documents using Wikidata as a semantic knowledge-base. Both structural (syntax tree) and semantic (identi er names) formula information will be considered. The approach is expected to improve search engines, recommender systems, plagiarism and novelty detection and ontology learning. Research Motivation My research is motivated by 1) the need for Information Retrieval systems to match mathematical formulae when assessing semantic content and similarity of STEM documents, and 2) the challenge that a given mathematical formula concept usually appears in several variations or equivalent representations.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>retrieved from the arXiv repository of electronic preprints (http://arxiv.org/) and Wikipedia. I am striving to
develop a method that will be able to map, e.g., all of the formulae collected in gure 1 - in particular, linkable to
the Wikidata entry https://www.wikidata.org/wiki/Q868967. I chose the semantic knowledge-base Wikidata
because it is free, open and can be read and edited by humans and machines.</p>
      <p>Research Method
Formula Feature Analysis
The rst step in the formula feature analysis is tokenization, i.e., the decomposition into their components
(identi ers, operators, numbers, etc.) and Part-Of-Math-tagging: a formula consists of di erent terms, which
pmhuyssticbse, adgisatiinnguusiesdheads farnominsetarcuhctoitvheeer.xaTmhpeleK, lceoinn-tGaionrsdoanteerqmuatc1i2o@n@t22c12 @@wt22ith a rdo2ub+le mtih2m2c2e d=eri0vaftriovme, qounaenwtuitmh
a double space derivative r2 as well one with a constant prefactor mh22c2 -the rst term can then be further
decomposed into its characters (tokens), that is, the denominator c for the speed of light, the operator @t with
an exponent (number) 2 and the identi er for the physical (quantum) wave function.</p>
      <p>When analyzing the semantics of a formula, we are faced with the problem of identi er ambiguity, which
requires disambiguation with the help of the partial clari cations available in the text. A single identi er has a
theoretically unlimited number of possible meanings, e.g., E in physics often refers to both an energy and an
electric eld, generally mathematically an expected value, etc. Thus, it is essential to improve the retrieval of
the semantics from the surrounding text.</p>
      <p>Research Questions
The aim of Formula Concept Discovery (FCD) is to 1) retrieve a large number of formula examples from Wikipedia
articles and arXiv documents together with a mapping to formula concepts (Wikidata items), and 2) recover a
general de nition of a formula concept using feature analysis and abstract mathematical formalization.
The aim of Formula Concept Recognition (FCR) is to identify formulae in arXiv documents or Wikipedia
articles as Wikidata formula concept items. Therefore, a measure of similarity that allows assigning a formula to
a mathematical concept (equation) if it exceeds a de ned threshold needs to be de ned. A rst rough approach
could be a matching score = # recognized elements / # total elements. To successfully identify a single element,
for example, the Laplace operator r2 = , it must be assigned to the corresponding concept in Wikidata,
at https://www.wikidata.org/wiki/Q203484, i.e. to QID Q203484. The aim is to motivate active users of
Wikidata to gradually build a hierarchical structure of the formula elements, assign elements to all available
formulae (property has part ) and create new items for formulae concepts directly including the parts.
Evaluation Plans
I will compare and discuss 1) several possible Formula Concept Discovery methods (e.g., taking the rst formula
from a Wikipedia article as de ning formula of the concept, formula clustering, etc.), and 2) several possible
Formula Concept Recognition methods (e.g., simple TeX string search vs. parts identi cation, recognition by
identi er name, symbol and value, etc.).</p>
      <p>Completed Research
In my rst publication [SGPS+18], I signi cantly contributed to the creation of a Gold standard MathMLben
for the evaluation of the conversion between di erent mathematical formats (LaTeX vs. Computer Algebra
Systems). In my second publication [SSD+18], I presented the rst math-aware QA system that can answer a
natural language question yielding a mathematical formula using Wikidata. My third recent publication [SSG18]
initiates my reasoning on a de nition of a formula concept and its possible content representations in LaTeX,
MathML, and Wikidata.</p>
      <p>Remaining Research
In my next publication, I will provide a thorough literature review on formula feature analysis. Together with M.
Schubotz and A. Greiner-Petter, I am planning to develop an annotation tool AnnotaTeX for LaTeX documents
that will facilitate the annotation process by recommending identi er names to the user. Figure 1 (right) shows
a proposed User Interface.
[SGPS+18] Moritz Schubotz, Andre Greiner-Petter, Philipp Scharpf, Norman Meuschke, Howard Cohl, and Bela</p>
      <p>Gipp. Improving the representation and conversion of mathematical formulae by considering their
[SSG18]</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>textual context</article-title>
          .
          <source>In Proceedings of the ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL)</source>
          , Fort Worth, USA, Jun.
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Moritz</given-names>
            <surname>Schubotz</surname>
          </string-name>
          , Philipp Scharpf, Kaushal Dudhat, Yash Nagar, Felix Hamborg, and
          <string-name>
            <given-names>Bela</given-names>
            <surname>Gipp</surname>
          </string-name>
          .
          <article-title>Introducing mathqa - a math-aware question answering system</article-title>
          .
          <source>In Proceedings of the ACM/IEEECS Joint Conference on Digital Libraries (JCDL)</source>
          , Workshop on Knowledge Discovery, Fort Worth, USA, Jun.
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Philipp</given-names>
            <surname>Scharpf</surname>
          </string-name>
          , Moritz Schubotz, and
          <string-name>
            <given-names>Bela</given-names>
            <surname>Gipp</surname>
          </string-name>
          .
          <article-title>Representing mathematical formulae in content mathml using wikidata</article-title>
          .
          <source>In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR)</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>