<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>and workflow gaps
in library efforts to preserve online journal
literature. Since libraries are increasingly
involved in journal publishing</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>A Model for Integrating the Publication and Preservation of Journal Articles</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>© Kevin S. Hawkins</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Proceedings of the 15</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Michigan</institution>
          ,
          <addr-line>Ann Arbor</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <fpage>112</fpage>
      <lpage>116</lpage>
      <abstract>
        <p />
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Until quite recently, publishers produced documents on
physical media, and libraries acquired and preserved
copies of these documents. But in the era of the
Internet, when publishers host content online, the
library’s role in acquiring and preserving the content is
in jeopardy: without special licensing arrangements
such as those often provided by open-access journals, a
library has no legal right to make a copy of the content
for preservation.</p>
      <p>
        Various business models have evolved to address
this situation, especially for journals, which are
increasingly available only online. For non-open-access
journals, research libraries often negotiate the right to
create a digital copy of any content acquired during the
period of subscription [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and make this content
available only to their patrons [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], though few are
equipped to provide this kind of restricted access and
archiving with integrated browse and search functions.
To address the more pressing concern of publishers
going out of business without any libraries holding a
copy of the content, libraries and publishers have
collaborated in initiatives like LOCKSS [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], CLOCKSS
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], and Portico [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] in order to guarantee that one or
more copy of the content will become available if it is
no longer available from the publisher. Similarly, the
Koninklijke Bibliotheek and Elsevier reached an
agreement in 2002 whereby the KB will preserve
Elsevier journals under terms similar to those governing
journals that use LOCKSS, CLOCKSS, and Portico [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
Still, there are problems with these models. LOCKSS
and CLOCKSS use web crawling, which captures only
the appearance of webpages but not their underlying
structure or search functionality. Portico and the KB, on
the other hand, rely on publishers to deliver journal
articles in valid file formats, and not just the version
first published but also any corrected versions of these
articles.
      </p>
      <p>
        One way to ensure that a library always has access
to the latest content is for the library to operate the very
system used to publish the journal. A survey in 2010 of
a cross-section of North American academic libraries
found that, of 144 responding institutions, 43 offered
“operational publishing services” to their scholars at the
institution [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Of these 43 institutions, most host
publications using open-source software such as Open
Journal Systems (OJS) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] or DSpace [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], while about a
quarter use Digital Commons [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], a hosted platform
provided by bepress. Unfortunately, all of these
platforms deliver to users only those files (primarily
PDF files) created and uploaded by a journal editor.
Since the library is not in a position to control the
software and workflows used to create these files, the
library can only provide bitwise preservation of the
files, severely hampering future migration of the
content.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2 A higher standard for preservation</title>
      <p>
        Since libraries are increasingly involved in journal
publishing, HathiTrust [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], a shared
preservationquality digital repository, is a natural place to archive
and provide access to journal literature to ensure its
long-term preservation and discoverability. HathiTrust
already archives and provides access to reformatted
library holdings, but the University of Michigan
Library, a founding member of HathiTrust, sees an
opportunity to use HathiTrust for publishing
borndigital journals as well. To develop an infrastructure in
support of low-cost university-based publishing that
addresses the needs and values of both content creators
and librarians, the U-M Library is funding the creation
of mPach [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], an open-source, end-to-end publishing
system in which the act of publishing and the act of
archiving are unified. In other words, archiving in
HathiTrust happens as a byproduct of publication rather
than being carried out after the fact. mPach leverages
existing components of HathiTrust and available
opensource software where appropriate.
      </p>
      <p>Archiving is not as simple as saving a copy of a file
produced by a journal editor, as OJS and institutional
repositories generally do. Instead, the content needs to
be stored in a format that allows digital preservation.
PDF/A, a non-proprietary variant of the PDF family
standardized as ISO 19005, is often suggested for such
needs, but even a PDF/A file is poorly suited for use
with screen readers for the visually impaired and for
any non-paginated display, and is suboptimal even for
searching and data mining.</p>
      <p>
        Rather than preserving the paginated appearance of
a document, the text of the article needs to be stored in a
format that reflects its structure and semantics, with
associated media in formats that can be preserved and
rendered. mPach has developed a specification for
journal articles that uses the Journal Article Tag Suite
(JATS), an application of NISO Z39.96-2012 [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], for
the text and stores this with high-quality versions of
media objects and with a METS record containing
structural and preservation metadata.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3 An overview of mPach</title>
      <p>There are three major parts of mPach (see also figure 1),
each of which includes components in various stages of
development at the time of writing:
• the peer review and editorial system: what
authors and reviewers interact with
• Prepper: what prepares the article for ingest
into HathiTrust for archiving and
publication
• modified HathiTrust components: various
modifications to existing components of
the HathiTrust environment to support
born-digital journal articles</p>
      <p>As a modular system, mPach could be used with any
peer review and editorial system that is capable of
interacting with Prepper; however, the developers have
chosen to provide OJS as the default option. Despite
having no support for digital preservation, OJS is
already widely used for library-based journal
publishing, and mPach’s integration with this software
will allow for a smooth transition of journals already
published using OJS into the HathiTrust repository.
Integration with mPach requires that manuscripts that
reach the “layout” stage in OJS be sent to Prepper,
which prepares the HathiTrust Submission Information
Package (SIP).</p>
      <p>
        Prepper provides a user interface for the editor of a
journal: a dashboard for administering the journal and
putting manuscripts through a production process—akin
to composition and typesetting—that prepares all
content according to the preservation standard
developed for mPach content in HathiTrust. Prepper
invokes Norm, a Python application developed to
convert manuscripts from Office Open XML
(“DOCX”) format [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] into XML that conforms to
JATS. DOCX is the default option because, like OJS, it
is widely used in the editorial process of journals
published by libraries. The Prepper interface also guides
the staff member through a review of validation errors
detected by Norm’s conversion, uploading
highresolution figures, supplying “alt text” for figures,
previewing the article as rendered using the default
stylesheet (based on the Preview XSLT stylesheets
[
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]), uploading supplementary material [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], and
submitting for ingest into HathiTrust.
      </p>
      <p>
        mPach requires a number of significant
modifications to HathiTrust components and workflows
originally designed to support reformatted print
materials. The reading interface in HathiTrust, which
previously supported only rendering of digitized page
images, renders JATS XML in HTML and allows a user
to download a dynamically generated PDF and EPUB,
display metadata specific to articles (figure 2), and link
to a special “collection” for the journal in HathiTrust’s
Collections application [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] that allows for browsing
volumes and issues of the journal (figure 3).
      </p>
      <p>
        Discovery of known items in HathiTrust using
metadata like title and author is currently provided for
by a catalog of MARC records, with one per item in the
repository. For mPach, each article has its own analytic
catalog record, tied to a monographic record for the
journal as a whole. Finally, the HathiTrust Data API
[
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] allows for the content of each article to be retrieved
for use outside of the native HathiTrust interface.
      </p>
      <p>Note that by policy HathiTrust only closes access to
content for legal reasons, not because a rightsholder
wants to restrict access. Therefore, mPach only supports
the publishing of open-access journals.</p>
    </sec>
    <sec id="sec-4">
      <title>4 Workflow</title>
      <p>In the typical workflow for publishing a journal using
mPach, a journal editor uses OJS to manage
submissions, peer review, and the editing process. Once
an article reaches the “layout” stage (where a
combination of composition and typesetting allows the
article to be formatted in a consistent way), the journal
editor formats it according to a predefined list of styles
in Microsoft Word and submits the article in DOCX to
mPach’s Prepper, which guides the editor through
conversion to JATS XML, preparation of the SIP, and
submission for ingest. Prepper keeps track of articles so
that a revised version can be submitted for ingest.
Currently the ingest process overwrites any previous
version of an item with the same identifier, but
eventually HathiTrust will archive past versions and
allow users to navigate among them.</p>
    </sec>
    <sec id="sec-5">
      <title>5 mPach as a shared infrastructure</title>
      <p>
        The U-M Library plans to host the Prepper system,
including the submission module, to facilitate
authorized deposit of content, and will make this system
available for use by organizations wishing to publish
journal literature in HathiTrust. The developers
envision extending the Norm component to handle
OpenDocument (“ODT”) [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] and LaTeX as input
formats, each of which is more commonly used in
certain communities. Furthermore, if the Book
Interchange Tag Suite [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] is adopted as a standard, the
mPach architecture might be extended to support
monograph publishing. While mPach is currently being
developed to meet the needs of the U-M Library, the
contribution of the sourcecode to the planned
HathiTrust Development Environment should foster
contributions from developers not at U-M and therefore
lead to the creation of a truly shared infrastructure for
publishing open-access scholarly journals.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Sadie</surname>
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Honey</surname>
          </string-name>
          .
          <article-title>Preservation of electronic scholarly publishing: an analysis of three approaches</article-title>
          .
          <source>Portal: Libraries and the Academy</source>
          ,
          <volume>5</volume>
          (
          <issue>1</issue>
          ):
          <fpage>59</fpage>
          -
          <lpage>75</lpage>
          , Jan.
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>NISO</given-names>
            <surname>SERU</surname>
          </string-name>
          <article-title>Standing Committee, SERU: A Shared Electronic Resource Understanding: A Recommended Practice of the National Information Standards Organization</article-title>
          . National Information Standards Organization (NISO),
          <year>May 2012</year>
          . http://www.niso.org/publications/rp/RP-7
          <article-title>- 2012_SERU</article-title>
          .pdf.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>[3] Lots of Copies Keeps Stuff Safe</article-title>
          . http://www.lockss.org/.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>[4] CLOCKSS. http://www.clockss.org/.</mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Portico</surname>
          </string-name>
          . http://www.portico.org/.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <article-title>[6] National Library of the Netherlands and Elsevier Science make digital preservation history: permanent digital archive assures perpetual accessibility of scientific heritage</article-title>
          .
          <source>August 20</source>
          ,
          <year>2002</year>
          . http://www.kb.nl/en/news/news-archive2002/
          <article-title>national-library-of-the-netherlands-andelsevier-science-make-digital-preservation-history.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>James</surname>
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Mullins</surname>
            , Catherine Murray-Rust, Joyce L. Ogburn, Raym Crow, October Ivens, Allyson Mower, Daureen Nesdill, Mark Newton, Julie Speer, and
            <given-names>Charles</given-names>
          </string-name>
          <string-name>
            <surname>Watkinson</surname>
          </string-name>
          .
          <source>Library Publishing Services: Strategies for Success: Final Research Report. March</source>
          <year>2012</year>
          . http://wp.sparc.arl.org/lps/.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Open</given-names>
            <surname>Journal</surname>
          </string-name>
          <article-title>Systems</article-title>
          . http://pkp.sfu.ca/ojs/.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>[9] DSpace. http://www.dspace.org/.</mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Digital</given-names>
            <surname>Commons</surname>
          </string-name>
          . http://digitalcommons.bepress.com/.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>HathiTrust</given-names>
            <surname>Digital</surname>
          </string-name>
          <article-title>Library</article-title>
          . http://www.hathitrust.org/.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>[12] mPach. http://www.lib.umich.edu/mpach.</mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Journal</given-names>
            <surname>Article Tag Suite</surname>
          </string-name>
          . http://jats.nlm.nih.gov/.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Office</surname>
            <given-names>Open XML</given-names>
          </string-name>
          . Wikipedia. http://en.wikipedia.org/wiki/Office_Open_XML.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>NISO</given-names>
            <surname>Journal Article Tag</surname>
          </string-name>
          <article-title>Set (JATS) version 1.0: Preview XSLT stylesheets</article-title>
          . https://github.com/NCBITools/JATSPreviewStyles heets.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <article-title>Recommended Practices for Online Supplemental Journal Article Materials: a recommended practice of the National Information Standards Organization and the National Federation of Advanced Information Services</article-title>
          .
          <year>January 2013</year>
          . http://www.niso.org/publications/rp/rp-15-
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Collections. HathiTrust Digital</surname>
          </string-name>
          <article-title>Library</article-title>
          . http://babel.hathitrust.org/cgi/mb.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <article-title>HathiTrust Data API</article-title>
          . http://www.hathitrust.org/data_api.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>OpenDocument</surname>
          </string-name>
          . Wikipedia. http://en.wikipedia.org/wiki/OpenDocument.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Book</given-names>
            <surname>Interchange Tag</surname>
          </string-name>
          <article-title>Suite (BITS) 0.2 DRAFT</article-title>
          . http://jats.nlm.nih.gov/extensions/bits/.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>