<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Computational Environment: An ODP to Support Finding and Recreating Computational Analyses</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Michelle Cheatham</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Charles Vardeman II</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nazifa Karima</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pascal Hitzler</string-name>
          <email>pascal.hizlerg@wright.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Notre Dame</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Wright State University</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>The Computational Environment ontology design pattern models the environment in which a computational analysis was conducted down to the hardware level. The pattern is intended to support comparison and reproducibility of computational analyses. Due to the concrete nature of the pattern, it makes use of several controlled vocabularies, which are kept current through an automated script that leverages the manually curated information on Wikipedia.</p>
      </abstract>
      <kwd-group>
        <kwd>computational environment</kwd>
        <kwd>ontology design pattern</kwd>
        <kwd>reproducibility</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Recently there has been a push towards ensuring the reproducibility of scienti c
results presented in scholarly publications. Journals have relaxed page
restrictions on \methodology" sections, and funding agencies have begun to require
investigators to publish any data they collect to data repositories. This is key
to allowing scientists to build on the work of others and to avoid wasted funds
and e ort unnecessarily repeating work or re-collecting data. These e orts alone
are not likely to be su cient, however. To truly enable discoverability and
reproducibility of previous scienti c results as required by the scienti c method,
as well as to understand the context of those results so that the results can
be correctly interpreted and extended, there is a need to preserve the
underlying computations and analytical process that led to those results in a generic
machine-readable format.</p>
      <p>Signi cant work has already been done on modeling a computational analysis
process, but this prior work tends to stop short of capturing the underlying
hardware and operating system details necessary to replicate the procedure.
In this paper we therefore present an ontology design pattern to represent a
Computational Environment. The ODP was developed over the course of several
working sessions by a group of ontological modeling experts, library scientists,
and domain scientists from di erent elds, including computational chemists and
high-energy physicists interested in preserving analysis of data collected from the
Large Hadron Collider at CERN. Our goal was to arrive at an ontology design
pattern that is capable of answering the following competency questions:
{ What environment do I need to put in place in order to replicate the work
in Paper X?
{ There has been an error found in Script Y. Which analyses need to be re-run?
{ Based on recent research in Field Z, what tools and resources should new
students work to become familiar with?
{ Are the results from Study A and Study B comparable from a computational
environment perspective?
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Because the goal of reproducing computational analyses is recognized as very
important, there has been a large body of work on this topic. In this section
we focus on the existing work that is most similar to our goals. The primary
di erence between these previous e orts and our current one is the type of
software and hardware environment in which a computational analysis takes place.
For instance, Oscar Corcho and his colleagues have been working on Research
Objects [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], which are somewhat based on Carol Goble's Taverna work ows [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
This work assumes that the activities within the work ows are web services. As
a result, Research Objects do not model machine- and platform-speci c aspects
of the computational environment, such as processor speed, amount of memory,
operating system version, etc.
      </p>
      <p>
        Other work comes from the eld of digital object preservation. PREMIS, for
instance, is a data dictionary associated with the Open Archival Information
System [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The focus here is slightly di erent from our goals { rather than
attempting to preserve the environment necessary to reproduce or understand
a series of computational activities, PREMIS seeks to preserve the environment
necessary to render a document or program (such as a video game) that has been
archived. The focus on preserving an entity rather than a process, as well as the
need to remain compatible with the rest of PREMIS and the OAIS, mean that
the model is not suitable for our current use case. However, the PREMIS model
does consider hardware, software, and external dependencies, and it can serve
as a useful basis for the Computational Environment model we are developing.
      </p>
      <p>
        Another related e ort of which we are aware is work by Secure Business
Austria, together with a research group at Karlsruhe, to develop a \process context
model" [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. This work is in a sense a proper superset of our current goals: their
model considers hardware, operating system, software, and third party libraries
and services, but it goes far beyond that by also including the organizational
environment in which the process was executed, the people involved, licenses,
authorizations, etc. This wide-ranging coverage area and the sheer size of the
model (it contains 240 entities arranged in 25 major groups) make this model
unwieldy for our current e ort.
      </p>
      <p>
        Very recent work by researchers from Ghent University and the University of
Bonn is available online as a technical report [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The ontology presented there
models software components and their con guration with the intent of enabling
reproduction of computational experiments and analysis. Important entities in
their model include a software bundle, such as a library or application, a module,
which is a particular version of a library, and a component, which is a speci c part
of a module that can be called with particular parameters, such as a constructor
in an object-oriented application. This approach to modeling a computational
environment somewhat overlaps the one presented here. It lacks the details about
the hardware and some aspects of the operating system environment that this
e ort seeks to capture, but it does represent the software portions of the
computational environment important for reproducibility. It would be possible to align
the two models or to replace a portion of the model presented here with the one
from [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. We will discuss this in more detail in Section 5.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Scope</title>
      <p>Our goal in this work is to model the actual environment present during a
computational analysis. Representing all possible environments in which it is
feasible for the analysis to be executed is outside of the scope of our current
e ort. For example, statements such as `This analysis was done on a computer
running Ubuntu version 14.0.4' are in scope, while knowledge such as `For this
analysis, Ubuntu version 12.0 or greater is required' is not.</p>
      <p>Figure 1 shows a general representation of the hardware, software, and
external resources associated with a computational analysis. It is possible to consider
all of this part of the Computational Environment. Instead, we have decided
to de ne the boundaries somewhat more narrowly: we do not include the
runtime con guration and parameters as part of the environment. The rationale is
that to some extent the same environment should be applicable to many
computational analyses in the same eld of study, but this would not be true if we
included such analysis-speci c information as runtime parameters as part of the
environment. Data sources were considered outside of the con nes of the
computational environment for similar reasons. External web services and similar
resources were not included because they are not inter-related in the same way
as the environmental elements are. For instance, deciding to use a di erent
operating system often necessitates using a di erent version of drivers, libraries, and
software applications, whereas in most cases the entire blue section of Figure 1
could be changed with no impact on the external services.</p>
      <p>These decisions are obviously subjective to some degree, but they make
practical sense in this case because numerous knowledge representations for web
services and data resources already exist, and it does not seem productive to
recreate them here. Although highly relevant to scienti c work ows, we have further
decided to leave the modeling of parallel and grid computing environments for
future work.</p>
      <p>With the scope now well de ned, the question of how to represent the
environmental components can be addressed.
There are two general ways to approach this modeling task: abstract-to-concrete
and concrete-to-abstract. In the former, a environmental component is
recursively broken down into its constituent components until the desired granularity
is reached. For instance, introductory computer science textbooks traditionally
state that a computer (i.e. the hardware) consists of processors, memory, I/O
devices, and network interfaces. I/O devices can be categorized as disk drives,
visualization technologies, etc. The second way to approach the modeling task
is to begin from concrete data, such as from tools that collect or utilize
information about a computational environment, and generalize from that to more
abstract concepts. We have chosen the later approach for this e ort, using the
tools VMware Player and Docker as our data sources, and sanity-checking our
results using the abstract-to-concrete viewpoint and the competency questions
presented above.</p>
      <p>VMware Player is a hardware virtualization tool available free for non
commercial use. It enables many aspects of a computational environment, including
the operating system, libraries, environment variables, software, and local les,
to be con gured and saved as a virtual machine that can then be downloaded
and run on any computer with an x86 architecture. Creating a new virtual
machine involves indicating which hardware resources of the host computer system
should be made available to the virtual machine, and then installing and
conguring the guest operating system along with any other desired software. A
screenshot of the dialog for provisioning a new virtual machine and a snippet of
the resulting con guration le are shown in Figure 2.</p>
      <p>Docker is a free open source application that has become very popular among
scientists for reproducing computational analyses. In contrast to the hardware
virtualization provided by VMware Player, Docker provides operating system
virtualization by using abstractions built into operating system kernels such as
the linux container system (LXC). Starting from a base OS layer, Docker allows
the user to specify the series of steps necessary to con gure the environment and
run a program. It is currently restricted to Linux applications, but as of the time
of this writing Microsoft is said to be working on Windows capability. Figure 3
shows a Docker description le along with a snippet of a script describing the
con guration and commands to execute.</p>
      <p>Based on these computational environment preservation tools, we developed
the following set of data items relevant to the de nition of a computational
environment. It is important to note that not every item need be de ned in all
cases. In fact, if a particular item is not relevant to the successful duplication
of a computational analysis, it might be preferable to omit it rather than overly
constraining those attempting to replicate the work.
Based on the analysis described in the previous section, we developed the
Computational Environment ODP shown graphically in Figure 4. Classes are depicted
as yellow rectangles with solid borders, and properties are shown as labeled
arrows from domain to range. Literal datatypes are shown as rounded blue
rectangles. Subclass relationships are depicted as white-headed arrows. The purple
rectangles with dashed borders indicate controlled vocabularies, which are
connected to their corresponding class by a dashed line. These will be discussed in
more detail in the following section.
3 Examples of instances of `location' include mount point, URL, etc.
4 We did not nd a need to treat libraries or device drivers di erently from standalone
software.</p>
      <p>The axiomization of this pattern is relatively straightforward. The controlled
vocabularies (i.e. list of acceptable values for a particular type of entity) are
represented as named instances of the appropriate class, with an accompanying
OWL oneOf axiom to restrict the associated property to one of the allowable
instances, as shown below for the case of ProcessorArchitecture.</p>
      <p>&lt;ClassAssertion&gt;
&lt;Class IRI="#CPUArchitecture"/&gt;
&lt;NamedIndividual IRI="#AMD64"/&gt;
&lt;/ClassAssertion&gt;
&lt;ClassAssertion&gt;
&lt;Class IRI="#CPUArchitecture"/&gt;
&lt;NamedIndividual IRI="#PowerPC"/&gt;
&lt;/ClassAssertion&gt;
&lt;NamedIndividual IRI="#PowerPC"/&gt;</p>
      <p>...</p>
      <p>&lt;NamedIndividual IRI="#X86"/&gt;
&lt;/ObjectOneOf&gt;
&lt;/EquivalentClasses&gt;</p>
      <p>Both domain and range are speci ed for hasCores, hasDistributionVersion
and hasKernelVersion, but only the range is constrained for all other datatype
properties. Regarding object properties, we have scoped the domain and ranges
for hasCPUArchitecture, hasEnvironmentVariable, hasKernel and
hasDistribution, but scoped ranges only for hasSize, hasFrequency, and hasComponent, and
neither domain nor range for acquiredFrom.</p>
      <p>The listing below shows instance data for the pattern based on running
Wireshark on the rst author's laptop in order to analyze network tra c. IP addresses
have been obscured. While the ODP is commented to describe the classes and
properties, this example is included in order to more fully convey the meaning
and intended use of the entities within the pattern.
@base &lt;http://dase.cs.wright.edu/ontologies/ComputationalEnvironment#&gt; .
@prefix xsd: &lt;http://www.w3.org/2001/XMLSchema#&gt;
:exComputationalEnv a :ComputationalEnvironment .
:exMemory a :Memory ;</p>
      <p>:hasSize :exMemSize .
:exMemSize a :Amount ;
:hasNumericValue "16"^^xsd:double ;
:hasUnit "GB" .
:exDisk a :Disk ;</p>
      <p>:hasSize :exDiskSize .
:exDiskSize a :Amount ;
:hasNumericValue "500"^^xsd:double ;
:hasUnit "GB" .
:exProcessor a :Processor ;
:hasCPUType :Intel_Core_i5 ;
:hasCPUArchitecture :X86 ;
:hasCores "2"^^xsd:int ;
:hasFrequency :exFrequency .
:exFrequency a :Amount ;
:hasNumericValue "2.6"^^xsd:double ;
:exLocation :hasName "https://1.na.dl.wireshark.org/osx/</p>
      <p>Wireshark%202.2.7%20Intel%2064.dmg" .
:exShellEnv a :OS_Shell_Environment ;</p>
      <p>:hasEnvironmentVariable :exEnvVariable .
:exEnvVariable a :EnvironmentVariable ;
:hasName "SSH_CONNECTION" ;
:hasStringValue "xxx:xxx:xxx:xxx 60123 yyy:yyy:yyy:yyy" .
:exComputationalEnv :hasComponent :exMemory ,
:exDisk ,
:exProcessor ,
:exOperatingSystem ,
:exNetInterface ,
:exKeyboard ,
:exMonitor ,
:exSoftware ,
:exShellEnv .</p>
      <p>Our intention is that related ontologies, such as those used to capture a
computational analysis work ow, could refer to the computational environment ODP
presented here. Returning to the competency questions outlined at the beginning
of this paper, we see that this ODP speci es the underlying hardware, operating
system, and software that are fundamental to replicating a computational
analysis. Because the pattern includes the software involved in an analysis, critical
and/or frequently used software can be identi ed. Additionally, instance data in
this pattern could be queried in response to any errors discovered later in a piece
of software and any related analyses could be redone. Meanwhile, modeling the
hardware aspects of the environment allow one to determine when two studies
are directly comparable from a computational perspective.</p>
      <p>The pattern is intended to be exible enough to support modeling of a
computational environment under various conditions. For instance, sometimes it is
helpful for replicating the performance of parallelized algorithms if the single
core performance of the original system is known. This can be captured by
modeling the environment of a multi-core system with numCores equal to one (and
the rest of the instance data updated accordingly).</p>
      <p>The pattern is available on the Ontology Design Patterns website at http://
ontologydesignpatterns.org/wiki/Submissions:ComputationalEnvironment.
In addition, pattern development is hosted at Github5 where it is kept up to date
via the automated process described in the next section. Community comments
and contributions are always welcome.</p>
      <p>
        The software class in this model is roughly equivalent to the concept of a
software bundle in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], and it would be possible to align the two models at this
point if more granular modeling of the software aspects of the computational
environment were desired.
6
      </p>
    </sec>
    <sec id="sec-4">
      <title>Maintenance</title>
      <p>Reproducing a computational analysis is a relatively exacting endeavor, and this
has some implications for the Computational Environment ODP. One of these
is that this ODP is more concrete than most, down to the level of controlled
vocabularies for some of the entities most critical for replicating certain types of
computational analyses, such as those that involve assessing computation time.
Of course, controlled vocabularies are a double-edged sword: they are useful for
constraining possible values, but they require regular maintenance to make sure
that they remain up to date. This is particularly true in the case of
computational environments, where the pace of technology evolution can quickly make
controlled vocabularies for entities such as processors and operating system
distributions obsolete.</p>
      <p>In order to leverage the bene ts of controlled vocabularies while mitigating
the maintenance e ort, we have written a script that automatically runs each
month to update the controlled vocabularies in the ODP. This is done by using
the data on various Wikipedia pages. Wikipedia is manually curated by
thousands of people on a consistent basis, so it is an ideal source of data in this
case. When the script runs, it accesses the Wikipedia pages shown in Table 1
and uses basic text processing to extract the list of processor architectures,
processors, operating system kernels, and operating system distributions. It then
compares these lists to those present in the current version of the ODP. If there
are new items in the list, the pattern is updated and automatically pushed to
the GitHub repository. In order to avoid invalidating any previously valid linked
data published according to this ODP, the script does not remove any values
5 https://github.com/mcheatham/computationalEnvironmentODP
from the controlled vocabularies in the pattern, even if they are removed from
Wikipedia.6</p>
      <p>We note that there is some di erence between the technical de nitions of the
terms 'kernel' and 'distribution' and the way they are often used in practice.
For example, a computer may be running Ubuntu, in which case the the kernel
is Linux, and the distribution is Ubuntu. In the case of Mac OS, the kernel
is technically XNU (a variant of Mach), the distribution is macOS, and the
distribution version may be something like 10.12 (Sierra). However, it is relatively
common to refer to macOS as a kernel rather than a distribution. The same issue
sometimes comes up with CPUs and CPU architectures, albeit less frequently.
In situations like these, we employ the same categorization as that used on
Wikipedia.</p>
      <p>A copy of the update script is stored in the same GitHub repository as the
ontology design pattern. The script automatically executes at 3:00 am EST on
the rst of each month.</p>
      <p>Entity Wikipedia Page(s)
CPU Architecture List of CPU architectures
CPU List of microprocessors
OS Kernel List of operating systems</p>
      <p>OS Distribution List of operating systems &amp; List of Linux distributions
There has been signi cant work on developing semantic models to enhance the
reproducibility of computational analyses; however, most existing models begin
at the level of software or services. In this work we model a computational
environment down to the \bare metal" of the computer on which an analysis is
performed. This is particularly important for reproducing results involving things
such as computation time (or that may timeout or cause memory thrashing).
The pattern makes extensive use of controlled vocabularies in order to achieve a
high level of comparability between instances, while using automated scripts to
extract data from Wikipedia in order to limit the maintenance e ort required
to keep these controlled vocabularies current.</p>
      <p>We plan to utilize this model to capture the details of computational
environments mentioned in academic research papers, as well as those inherent in
virtual machine description les. Our eventual goal is to develop an application
6 In the case of misspellings, mistakes, or malicious Wikipedia edits, values can be
removed manually.
capable of searching for an appropriate virtual machine with which to replicate
the computational analysis described in a given paper.</p>
      <p>There are some limitations to this pattern that we hope to address in our
future work on this topic. The pattern currently does not model grid-based
computational environments. In addition, the pattern captures only the speci c
environment in which a computational analysis occurred, not any in which it
could be repeated without impacting the results.</p>
      <p>Acknowledgments. The rst, third and fourth author acknowledge partial
support by the National Science Foundation under award 1440202 EarthCube
Building Blocks: Collaborative Proposal: GeoLink { Leveraging Semantics and Linked
Data for Data Sharing and Discovery in the Geosciences. The rst, second and
fourth authors acknowledge partial support by the National Science Foundation
under award PHY-1247316 DASPOS: Data and Software Preservation for Open
Science</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Corcho</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garijo Verdejo</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Belhajjame</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Missier</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Newman</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Palma</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bechhofer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , Garc a Cuesta,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>Gomez-Perez</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.M.</surname>
          </string-name>
          , et al.:
          <article-title>Work owcentric research objects: First class citizens in scholarly discourse</article-title>
          . (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Dappert</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Peyrard</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Delve</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chou</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Describing digital object environments in premis</article-title>
          .
          <source>In: 9th International Conference on Preservation of Digital Objects (iPRES2012)</source>
          . pp.
          <volume>69</volume>
          {
          <issue>76</issue>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Goble</surname>
            ,
            <given-names>C.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>De Roure</surname>
          </string-name>
          , D.C.
          <article-title>: myexperiment: social networking for work ow-using e-scientists</article-title>
          .
          <source>In: Proceedings of the 2nd workshop on Work ows in support of largescale science</source>
          . pp.
          <volume>1</volume>
          {
          <issue>2</issue>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Mayer</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rauber</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Neumann</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thomson</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Antunes</surname>
          </string-name>
          , G.:
          <article-title>Preserving scienti c processes from design to publications</article-title>
          .
          <source>In: International Conference on Theory and Practice of Digital Libraries</source>
          . pp.
          <volume>113</volume>
          {
          <fpage>124</fpage>
          . Springer (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Taelman</surname>
            , R., Van Herwegen,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Capadisli</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verborgh</surname>
          </string-name>
          , R.:
          <article-title>Reproducible software experiments through semantic con guration</article-title>
          (May
          <year>2017</year>
          ), https:// linkedsoftwaredependencies.org/articles/reproducibility/, accessed:
          <fpage>2017</fpage>
          - 06-22
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>