<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>System Description sTEX - A LATEX-based Ecosystem for Semantic/Active Mathematical Documents</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Michael Kohlhase</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dennis Mu¨ller</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jan Frederik Schaefer</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Computer Science</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>FAU Erlangen-Nu¨rnberg</string-name>
        </contrib>
      </contrib-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>In the sLATEX project [sLX], we explore how established communication and
publication workflows – this mainly means L ATEX in Mathematics and theoretical
sciences – can be extended semantically for computer support. The central element
of this endeavour is the sTEX package [Koh08; sTeX] which allows to
semantically preload LATEX documents via special (semantic) macros. sTEX documents
can be processed by pdflatex in the usual way, or via LATEXML [LTX], a
LATEXto-XML transformer, which has a sTEX plugin. The semantic annotations are
exported into the generated PDF or OMDoc [Koh06] respectively where they
can be used for added-value services. The sTEX packages (and classes) have been
used to produce extensive course materials (3000+ pages of slides and integrated
narrative), ca. 2500 exercise/exam problems, and the SMGloM, a multilingual
mathematical glossary [Koh14], currently containing ≥ 2250 concepts in English
(93%), German (71%) and Chinese (11%). This sTEX corpus together with the
OMDoc/Mmt format have informed the development of the sTEX packages and
document model.</p>
      <p>While the original sTEX architecture and realization showed that
semantic preloading of the mathematical documents and the deployment of active
documents based on this is possible given enough motivation, scalability and
the management of shared content – one of the potential side-benefits of
semantic preloading – quickly became a problem. As a consequence we now host
and manage all sTEX content as mathematical archives [Hor+11] on https:
//MathHub.info and extended sTEX with special path functionality for
crossreferences. As a side efect, MathHub can host the interactive HTML generated
from the OMDoc in a central location.1</p>
      <p>In spite of this, the use of sTEX never quite gained much traction outside the
authors’ research group and collaborative projects. In this system description we
detail the efort over the last two years of making sT EX much more usable.
1 e.g. SMGloM under https://beta.mathhub.info/library/group/c21nbG9t</p>
      <p>The sTEX EcoSystem</p>
    </sec>
    <sec id="sec-2">
      <title>Simplification of sT EX Workflows</title>
      <p>Working with sTEX so far required using several external tools and modifying
LATEX parameters, mostly related to sTEX’s module system.</p>
      <p>Local paths To find the actual source files containing modularly imported sT EX
content, the previous workflow necessitated the creation of a localpaths.tex
for every top-level .tex file, that stores the local file path to sT EX
repositories. This workflow has been significantly simplified recently, by replacing the
localpaths.tex files by a single MATHHUB system variable, which points ot the
local MathHub clone. LATEX can now access without needing the --shellescape
lfag.</p>
      <p>SMS mode sTEX allows the introduction of new semantic macros within module
environments, using the \symdef command. Semantic macros defined in some
external module are made locally available with the \importmodule command.
Since modules can (and in practice often do) contain arbitrary additional
content, for \importmodule to work, the semantic content of a module needs to be
extracted from a module environment. Previously, this was achieved by a perl
script that heuristically parsed a lfie foo.tex for \symdef and similar commands,
and extracted those into a separate foo.sms file. \importmodule and related
commands then used the .sms file to selectively load only the semantic content. This
required both an external tool and posed a change management problem - every
change to an sTEX module required rerunning the perl script to ensure the .sms
ifle is consistent with the source .tex file.</p>
      <p>The usage of .sms files has now been deprecated. Instead, \importmodule and
related commands enter an “SMS mode” before inputting the required .tex
ifle directly, in which everything other than selected sT EX commands (such as
\symdef) is ignored, obviating the need for external tools or change
management considerations. The overhead of multiply reading the narrative content of
included files – redundant in SMS mode – turned out to be negligible.
File stack size The availability of a module system can quickly lead to deeply
nested dependency trees on modules (and hence .tex-files), especially when using
SMGloM modules. By default, LATEX has a (relatively low) upper limit for its
ifle stack , determining the number of individual files that the T EX engine is
allowed to have open at once. Consequently, sTEX users were advised to manually
increase the file stack size of their local L ATEX distribution, something most
LATEX users are unfamiliar with or need admin rights.</p>
      <p>This has been recently resolved by fully utilizing the diference between
reading a module (in SMS mode) and activating a module. During SMS mode (when
the containing .tex file is open), all semantic macros are merely stored in a
separate helper macro. Only after the file has been fully read (and closed by L ATEX),
its content is activated by executing all semantic macros therein, avoiding
recursive \importmodule-calls and ensuring that LATEX’s file stack only ever increases
by 1 for semantic imports, obviating the need to manually increasing the file
stack size.</p>
      <p>Standalone SMGloM files Previously, sTEX distinguished documents (with
LATEX preamble and document environment) with module (without) and used
external build tools to provide modules with preambles on the fly. With the need
for external tools – see above – otherwise alleviated, we realized that the LATEX
standalone package to make modules independently compilable by pdflatex:
standalone.sty allows for using \input on .tex files that themselves have a
header, without LATEX throwing an error. This had been possible earlier, but
now it is documented best practice.</p>
    </sec>
    <sec id="sec-3">
      <title>2.2 sTEXLS: An sTEX Language Server and IDE</title>
      <p>The highly fragmented structure2 of sTEX corpora can be a challenge when
creating and editing sTEX content. Some of the dificulties can be alleviated with
an IDE for sTEX. To avoid being tied to a specific editor, the sT EX IDE is based
on sTEXLS, a language server that could be used for any editor supporting
the language server protocol. sTEXLS has its roots in a bachelor’s thesis [Pli18],
which explored machine-learning-based approaches to find missing annotations
of term references in sTEX documents. The student behind [Pli18] has continued
working on sTEXLS, which now supports various features that help authoring
sTEX content. Aside from enabling simple interactions like cross-file definition
look-up, a key feature of sTEXLS is its ability to point out semantic problems in
the source files ( semantic linting). This ranges from minor issues like
redundant imports to actual errors like references to non-existent concepts. sTEXLS
addresses such errors by e.g. listing modules from which the concepts could be
imported or by suggesting similar sounding concepts in case of a spelling
mistake. These features are so useful that sTEXLS is now commonly used for the
creation of sTEX content.
2.3</p>
    </sec>
    <sec id="sec-4">
      <title>Generating Supplemental Material from sTEX Sources</title>
      <p>The semantic annotations allow deriving a number of supplemental resources for
an sTEX document. Concretely, we have tools to automatically generate
dictionaries, glossaries and dependency graphs e.g. to supplement the lecture notes.
These tools act directly on the sTEX sources, utilizing the the uniform structure
of semantic annotations.</p>
      <p>Dictionary generation exploits that definienda in a (natural language)
definition are explicitly annotated by a semantic concept. E.g. the synonyms “linear
ordering” and “simple ordering”, as well as the German translation, “lineare
Ordnung”, are identified as same concept. An English-to-German dictionary
2 As TEX cannot load document fragments natively, it is natural to prefer very small
source files that only contain small semantically self-contained fragments
would then have two entries, one for “linear ordering ” and one for “simple
ordering ”. To generate the dictionary for a lecture, we only include concepts that
have been referenced in the lecture notes/slides.</p>
      <p>If we map words to their definition rather than their translations, we have
a generated glossary. To make it “definitionally closed”, we also include entries
for all concepts referenced elsewhere in the glossary.</p>
      <p>We already routinely generate dictionaries and glossaries for some of our
lectures, which was appreciated by the students. We have also experimented
with the generation of concept dependency graphs but our visualization eforts
had limited success so far due to the sheer size of resulting graphs.
2.4</p>
      <p>sTEXML2: Partial Preloading and XHTML Harvesting
So far, to obtain formal content from sTEX documents, these documents needed
to be converted to OMDoc. To subsequently enable KM services, the resulting
OMDoc needs to be additionally converted to the specific OMDoc dialect used
by the Mmt system by splitting it into (formal) content OMDoc and (informal)
narrative omdoc. This workflow requires:
1. Dedicated sTEX document classes for OMDoc,
2. an sTEX-Plugin for LATEXML that allows generating OMDoc, partially
overriding core methods of LATEXML,
3. a suite of LATEXML bindings for most (if not all) sTEX primitive macros, and
4. an Mmt component for importing the OMDoc generated by LATEXML.
All of these components needed to be consistently kept in-sync with respect to
any updates regarding the sTEX-package, LATEXML, and Mmt, and as a result
regularly sufered from bitrot and increasingly bloated and dificult to maintain
implementations. Additionally, OMDoc generation was incompatible with the
document classes commonly used by LATEX users.</p>
      <p>As a result, we have deprecated the direct OMDoc generation via LATEXML
and the sTEX-Plugin and are re-basing the sTEX packages on a very selective set
of semantic primitives. Note that we only need to implement LATEXML bindings
for these and can reuse majority of sTEX functionality implemented in TEX –
LATEXML covers enough TEX/LATEX primitives by now. This makes it much
easier to maintain coherence between the LATEX implementation and the LATEXML
bindings. We now use LATEXML to generate XHTML (which LATEXML
supports natively) with OMDoc-annotations (provided by the package bindings
alone). Crucially, this is compatible with all existing LATEXML bindings, and
yields documents that can be immediately inspected with a web browser
without loss of document content or significantly impacting layout, maintaining the
narrative structure of the original document while introducing partial OMDoc
information where induced by semantic macros. Afterwards, the Mmt system
can harvest the generated XHTML to extract the OMDoc fragments relevant
for KM services. This simple change of approach realizes an old desideratum in
the sLATEX project: flexibly mixing (partial) sT EX functionality into arbitrary
LATEX document classes – this is called “sTEX light” in [KKL10].</p>
      <p>Conclusion &amp; Future Work
We have significantly improved the user-friendliness of the sL ATEX ecosystem by
minimizing the number of required external tools and simplifying the general
workflow, while supplying additional optional tools for added-value services.</p>
      <p>As future work, we intend to
1. extend the sTEX language to subsume all primitives of the Mmt/OMDoc
ontology, allowing sTEX to serve as a full surface language for Mmt,
2. allow for semantic markup of arbitrary (in particular informal natural
language) document fragments, and
3. improve and extend IDE support, e.g. by providing direct access and search
functionality for the SMGloM.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [Hor+11]
          <string-name>
            <given-names>Fulya</given-names>
            <surname>Horozal</surname>
          </string-name>
          et al. “
          <string-name>
            <surname>Combining</surname>
            <given-names>Source</given-names>
          </string-name>
          , Content, Presentation, Narration, and
          <article-title>Relational Representation”</article-title>
          . In: Intelligent Computer Mathematics. Ed. by James Davenport et al.
          <source>LNAI 6824</source>
          . Springer Verlag,
          <year>2011</year>
          , pp.
          <fpage>212</fpage>
          -
          <lpage>227</lpage>
          . isbn:
          <fpage>978</fpage>
          -3-
          <fpage>642</fpage>
          -22672-4. url: https : / / kwarc . info / frabe / Research / HIJKR_dimensions_11.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [KKL10]
          <string-name>
            <given-names>Andrea</given-names>
            <surname>Kohlhase</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Kohlhase</surname>
          </string-name>
          , and Christoph Lange.
          <article-title>“sTeX - A System for Flexible Formalization of Linked Data”</article-title>
          .
          <source>In: Proceedings of the 6th International Conference on Semantic Systems (I-Semantics) and the 5th International Conference on Pragmatic Web. Ed. by Adrian Paschke et al. ACM</source>
          ,
          <year>2010</year>
          . isbn:
          <fpage>978</fpage>
          -1-
          <fpage>4503</fpage>
          -0014-8. doi:
          <volume>10</volume>
          .1145/1839707.1839712. arXiv:
          <volume>1006</volume>
          .4474v1 [cs.SE].
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [Koh06]
          <string-name>
            <given-names>Michael</given-names>
            <surname>Kohlhase</surname>
          </string-name>
          .
          <source>OMDoc - An open markup format for mathematical documents [Version 1.2]. LNAI 4180</source>
          . Springer Verlag, Aug.
          <year>2006</year>
          . url: http: //omdoc.org/pubs/omdoc1.2.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [Koh08]
          <string-name>
            <given-names>Michael</given-names>
            <surname>Kohlhase</surname>
          </string-name>
          .
          <article-title>“Using LATEX as a Semantic Markup Format”</article-title>
          .
          <source>In: Mathematics in Computer Science 2.2</source>
          (
          <issue>2008</issue>
          ), pp.
          <fpage>279</fpage>
          -
          <lpage>304</lpage>
          . url: https://kwarc. info/kohlhase/papers/mcs08-
          <fpage>stex</fpage>
          .pdf.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [Koh14]
          <string-name>
            <given-names>Michael</given-names>
            <surname>Kohlhase</surname>
          </string-name>
          .
          <article-title>“A Data Model and Encoding for a Semantic, Multilingual Terminology of Mathematics”</article-title>
          . In: Intelligent Computer Mathematics 2014. Ed. by Stephan Watt et al.
          <source>LNCS 8543</source>
          . Springer,
          <year>2014</year>
          , pp.
          <fpage>169</fpage>
          -
          <lpage>183</lpage>
          . isbn:
          <fpage>978</fpage>
          -3-
          <fpage>319</fpage>
          -08433-6. url: https://kwarc.info/kohlhase/papers/ cicm14-smglom.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [LTX]
          <article-title>Bruce Miller</article-title>
          .
          <article-title>LaTeXML: A LATEX to XML Converter</article-title>
          . url: http://dlmf. nist.gov/LaTeXML/ (visited on 03/12/
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [Pli18]
          <string-name>
            <surname>Marian Plivelic. “</surname>
          </string-name>
          <article-title>Using machine learning to support annotating of keywords in mathematical texts”</article-title>
          .
          <source>B.Sc. Thesis</source>
          . FAU Erlangen-Nu¨rnberg, Feb.
          <year>2018</year>
          . url: https://gl.kwarc.info/supervision/BSc-archive/blob/master/ 2018/Plivelic_Marian.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [sLX]
          <article-title>sLaTeX: An Ecosystem for Semantically Enhanced LATEX</article-title>
          . url: https:// github.com/sLaTeX (visited on 03/11/
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [sTeX]
          <article-title>sTeX: A semantic Extension of TeX/LaTeX</article-title>
          . url: https://github.com/ sLaTeX/sTeX (visited on 05/11/
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>