<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>MDML: The Mathdoc Digital Mathematics Library</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alexandre Bouquet</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thierry Bouche</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Univ. Grenoble Alpes</institution>
          ,
          <addr-line>CNRS, CMD, 38000 Grenoble</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <fpage>8</fpage>
      <lpage>12</lpage>
      <abstract>
        <p>Following the steps of previous projects such as EuDML, Mathdoc is launching its Digital Mathematics Library. Based on a reliable infrastructure made for Numdam, learning from previous projects, and relying on a network of institutions we trust, we aim to push the ball further for accessing mathematical content online. We focus for a start on the aggregation part, aiming to reach a critical mass of mathematical content by harvesting various sources: OJS instances, preprint repositories, and locals DMLs. We thus build a database of mathematical documents, linking back to the source's website for accessing content.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>Goals and perimeter</title>
      <p>
        Basically the goal is to make a big part of the mathematical corpus available from the same place, with the best
possible metadata to facilitate searching, and interoperate with relevant infrastructures. The system is based on
• an OAI-PMH harvester to gather metadata;
• a periodic task orchestrator;
• the new Numdam platform [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] to provide the core XML parser, and the searching and browsing interface.
      </p>
      <p>As the project started a few months ago, we intend to first reach a critical mass in indexed content with a
highly usable interface, so the focus is currently on the aggregation part. Later on, we will take an incremental
approach to improve our DML services.</p>
      <p>Compared to mini-DML, which was a proof-of-concept, we try to use much more detailed metadata in order
to offer a better user experience, we also started to harvest from much more sources, and to adapt to more
common metadata schemas. Compared to EuDML, we use more or less the same metadata, we do not have yet
an API, we won’t try to revive some of its features. The main benefit of our work is to move ahead: an entirely
new technology behind, a worldwide scope. Our main target audience is the working mathematician, always
struggling to find a source for published references they gather from database searches, citations or colleagues’
hints. We try to make it easy to search a still highly fragmented, heterogeneous corpus.
2.1</p>
      <sec id="sec-2-1">
        <title>Choosing data source</title>
        <p>The first thing to do for building a DML is to choose where the data comes from. There are quite a lot of
available resources on the internet, and thus we must choose on which criterion we base our choice. We decided
to stand on EuDML’s shoulders, which means we intend to aggregate from local DMLs that ensure
• quality of the mathematical content;
• long-term reliability (well-maintained systems with persistent URLs);
• a usable OAI-PMH server delivering quality metadata (JATS/BITS if possible, or at least fine-grain enough
to enable a decent browsing of collections).</p>
        <p>We thus rely on a network of institutions we trust, with a common goal of archiving and broadcasting
mathematical content, with sustainability rather than profit in mind.</p>
        <p>Following on mini-DML, we decided to also include preprint servers such as arXiv or HAL, because they
provide open access to a huge quantity of useful mathematics. It will be possible to filter search results so as to
exclude preprints, when the user is looking for formally published material only1.</p>
        <p>In order to maximize the number of sources, we also started to ingest content from isolated journals published
with Open Journal System (OJS), when we believe that they are backed up by a trusted institution, such as a
learned society or a University library. OJS instances are now shipped with an OAI server, and a JATS plugin
is available, so a lot of quality items are available through this method. The challenge here will be to draw an
inventory of all relevant OJS instances, and to select which are eligible with our criteria.</p>
        <p>EuDML has done a great job of ingesting data sources with still no support of OAI protocol to this day, thus
we use EuDML’s OAI server to retrieve some of its content, in order to take advantage of the former project
rather than spending time reproducing what has been done. We avoid it when it’s possible though, because
we want regular updates from our chosen sources, and EuDML is currently stalled. We intend to harvest anew
currently frozen sources when possible.</p>
        <p>When we have defined the source, the next step is to import the data in our database. The goal here is not to
store locally a copy of the full text of articles, but to facilitate the search and link to the source for accessing the
content. We build thus more a catalog than a physical library, making the choice of the source more important
as we rely on it to provide content.</p>
        <p>Our first goal is thus to break the (quite artificial) “EU” barrier in EuDML. In this first round, we leave out
the more fancy stuff as we focus on making more content visible.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Importing data</title>
        <p>
          As outlined above, we select sources that support the OAI-PMH protocol [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. This protocol makes it possible to
retrieve easily metadata of mathematical items, with an explicit XML schema. EuDML set a flavor of JATS [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]
as its internal format and, as other EuDML partners, we adopted it afterwords as our internal format for
1When arXiv or other preprint repositories will have explicit metadata for identifying postprints (“author accepted manuscript”
or “version of record”: content identical to the published version), these will be considered acceptable alternatives to publisher’s
version.
document-oriented projects such as Numdam2, Centre Mersenne3, etc. We like it because it is very exhaustive
and well-structured. We also like it because it is meant for the kind of content we deal with, including native
support for MathML expressions, and the ability to encode up to the full text. However, we also import articles
from Dublin Core when this is the only format available, because it represents a big part of resources available
through OAI. However only basic metadata are retrieved with this format. We import books in BITS format [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
        <p>As said above, this is the first step for importing data, we are prepared to support more formats and other
protocols over time.</p>
        <p>
          Once harvested as or converted to JATS and BITS, we clean somehow the metadata and ingest it in our
platform, which is based on the one described in [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] which has now been used for large documents sets for
different projects (Numdam, Centre Mersenne). Most of the database structure to store the metadata is based
on the platform’s existing one. This core system will continue to improve with the evolution of the multiple
projects of Mathdoc including MDML.
        </p>
        <p>An other big advantage of the OAI Protocol is the possibility to choose in which date interval we want to
import the data, thus making the import of new data easy, and avoiding the cost of ingesting the same data
again and again. Moreover, it allows us to set up automated regular update over all of our sources, bringing new
content automatically on MDML. Although scheduled, this never really happened with EuDML harvester.</p>
        <p>We also set up a log system to keep track of import, storing raw data and the source, making it possible to
understand why it crashed, and how to fix it.</p>
        <p>When we have stored the data, the next step is to present the data back to the users, and make it possible to
browse the digital library, in the same way a user can wander in a physical library and browse printed volumes.
Searching among digital items can be done in different manner: searching authors, keywords, title, equations or
browsing specific journals or books. To be able to propose an fielded search, we need to have thorough metadata.
However, the most important thing is to have a lot of items available, if we really want added value.</p>
        <p>The implementation of searching in MDML is based on the Numdam platform, thus benefiting from an already
proven tool. The searching engine is getting better over time because its core is common for Numdam and all
journals managed by Centre Mersenne.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Technical/implementation details</title>
      <sec id="sec-3-1">
        <title>Backend</title>
        <p>As the MDML project is tightly linked with the Numdam platform written in Python, it seemed obvious to use
Python as well for MDML. To harvest and retrieve the XML from OAI sources, we use the great Sickle plugin4.
We store different information about the source to harvest: OAI server url, OAI set, XML format, if it is a one
shot ingestion or not (i.e. needs update or not), the type of provider and the last harvest date. Moreover, we
also store what kind of processing we will do. The tricky part is that almost every source of data is different in
some way, regardless of format used (JATS, Dublin Core, etc). Even if it’s a minor difference, we need to have a
system with a common processing and specify only the small part specific to each source. Thus we made a sort
of multiplexer of XML parsers, depending on OAI server. Each source parser inherits from the common XML
parser, and we override what’s different.</p>
        <p>Then we feed it to the Numdam based platform, with an additional layer to store OAI metadata such as OAI
id and OAI source.</p>
        <p>A recurrent task has been set up in the background with Celery5, to check regularly for new data. As said
earlier, OAI-PMH allows us to specify a date interval for harvesting, so the last harvest date is stored for every
source of MDML, and we update it at each new harvest.</p>
        <p>Detailed logs are stored if any of the items harvested failed to be ingested by the platform, including raw
XML and source information, then allowing us to enhance our import tasks quickly, and to do the import again.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Frontend</title>
        <p>The website is based on Django, same as Numdam and sites managed by Centre Mersenne. There is of course an
additional layer as well to serve items on MDML website, but the core is common and can benefit from the work
done by Mathdoc on Numdam. For instance, the platform natively supports a dual TeX/MathML description
for mathematical content, with mathjax on board in order to present it correctly in most situations. The end
goal here is a one click access to the article on its source website.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion and perspective</title>
      <p>
        Based on the experience of Mathdoc and its various projects in the area of mathematical documents and
metadata, and all previous DML projects, the ambition is that the Mathdoc DML be a significant step forward in
terms of content covered towards the Global DML [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] supported by the IMU and the newly founded International
Mathematical Knowledge Trust. We choose to have an incremental approach, and to set up a solid foundation
based on the production-ready platform maintained by Mathdoc. The project will evolve over time, and there
is still a lot of work to be done. The number of items will grow by itself as new content is published at the
sources we harvest, and new sources will be regularly added. The quality of the search engine, browsing and
metadata displayed will also improve over time, alongside Numdam and Centre Mersenne’s websites. In the end,
we hope to provide a DML with a great deal of items, and thorough metadata to be able to browse seamlessly
mathematical content.
4https://github.com/mloesch/sickle
5http://www.celeryproject.org/
      </p>
      <p>Figure 2: Searching for items in MDML</p>
      <p>Figure 3: One article’s details on MDML</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Thierry</given-names>
            <surname>Bouche</surname>
          </string-name>
          .
          <article-title>Introducing the mini-DML project</article-title>
          . In Hans Becker, Kari Stange, and Bernd Wegner, editors, New developments in electronic publishing, AMS/SMM Special Session, Houston, May
          <year>2004</year>
          , ECM4 Satellite Conference, Stockholm,
          <year>June 2004</year>
          , pages
          <fpage>19</fpage>
          -
          <lpage>29</lpage>
          . FIZ Karlsruhe / Zentralblatt MATH,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Thierry</given-names>
            <surname>Bouche</surname>
          </string-name>
          and
          <string-name>
            <given-names>Olivier</given-names>
            <surname>Labbe</surname>
          </string-name>
          .
          <article-title>The new Numdam platform</article-title>
          . In Herman Geuvers, Matthew England, Osman Hasan, Florian Rabe, and Olaf Teschke, editors,
          <source>Intelligent Computer Mathematics Proceedings of the 10th International Conference, CICM</source>
          <year>2017</year>
          ,
          <article-title>Edinburgh</article-title>
          ,
          <string-name>
            <surname>UK</surname>
          </string-name>
          ,
          <source>July 17-21</source>
          ,
          <year>2017</year>
          ,
          <source>number 10383 in Lecture Notes in Computer Science</source>
          , pages
          <fpage>70</fpage>
          -
          <lpage>82</lpage>
          . Springer,
          <year>2017</year>
          . also available at http://doi.org/10.5281/zenodo. 581405xs.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>John</given-names>
            <surname>Ewing</surname>
          </string-name>
          .
          <source>Twenty Centuries of Mathematics: Digitizing and Disseminating the Past Mathematical Literature. Notices of the AMS</source>
          ,
          <volume>49</volume>
          (
          <issue>7</issue>
          ):
          <fpage>771</fpage>
          -
          <lpage>777</lpage>
          ,
          <year>08 2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Patrick</surname>
            <given-names>D. F.</given-names>
          </string-name>
          <string-name>
            <surname>Ion</surname>
          </string-name>
          and
          <string-name>
            <surname>Stephen M. Watt</surname>
          </string-name>
          .
          <source>The Global Digital Mathematics Library and the International Mathematical Knowledge Trust</source>
          . In Herman Geuvers, Matthew England, Osman Hasan, Florian Rabe, and Olaf Teschke, editors,
          <source>Intelligent Computer Mathematics 10th International Conference, CICM</source>
          <year>2017</year>
          ,
          <article-title>Edinburgh</article-title>
          , UK,
          <year>2017</year>
          , Proceedings. Springer,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>National</given-names>
            <surname>Center for Biotechnology Information</surname>
          </string-name>
          , U.S. National Library of Medicine.
          <source>Book interchange tag set: JATS extension, version 2</source>
          .0,
          <year>February 2016</year>
          . Full online documentation at https://jats.nlm.nih.gov/ extensions/bits/.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>National</given-names>
            <surname>Center for Biotechnology Information</surname>
          </string-name>
          , U.S. National Library of Medicine.
          <article-title>Journal archiving and interchange tag library</article-title>
          ,
          <source>NISO JATS version 1</source>
          .2,
          <string-name>
            <surname>January</surname>
          </string-name>
          <year>2019</year>
          . Full online documentation at https://jats. nlm.nih.gov/index.html.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Open</given-names>
            <surname>Archives Initiative</surname>
          </string-name>
          .
          <article-title>Protocol for Metadata Harvesting</article-title>
          . Documentation at http://www.openarchives.
          <source>org/OAI/2</source>
          .0/openarchivesprotocol.htm.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Wojtek</given-names>
            <surname>Sylwestrzak</surname>
          </string-name>
          , José Borbinha, Thierry Bouche, Aleksander Nowiński, and Petr Sojka.
          <source>EuDMLTowards the European Digital Mathematics Library</source>
          . In Petr Sojka, editor,
          <source>Proceedings of DML 2010</source>
          , pages
          <fpage>11</fpage>
          -
          <lpage>24</lpage>
          , Brno,
          <year>July 2010</year>
          . Masaryk University. http://dml.cz/dmlcz/702569.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>