<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Organic Data Publishing: A Novel Approach to Scientific Data Sharing</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Paul C. Hanson Center for Limnology University of Wisconsin-Madison</institution>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Yolanda Gil and Varun Ratnakar Information Sciences Institute and Department of Computer Science University of Southern California</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Many scientists do not share their data due to the cost and lack of incentives of traditional approaches to data sharing. We present a new approach to data sharing that takes into account the cultural practices of science and offers a semantic framework that 1) links dataset contributions directly to science questions, 2) reduces the burden of data sharing by enabling any scientist to contribute metadata, and 3) tracks and exposes credit for all contributors. To illustrate our approach, we describe an initial prototype that is built as an extension of a semantic wiki, can import Linked Data, and can publish as Linked Data any new content created by users.</p>
      </abstract>
      <kwd-group>
        <kwd>Scientific data sharing</kwd>
        <kwd>provenance</kwd>
        <kwd>semantic wiki</kwd>
        <kwd>Linked Data</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Although scientists in many disciplines share data through catalogs so that others
can harvest those data for analysis and publications (e.g., in astronomy, physics, etc),
this paradigm has not worked well in ecology. Ecology is a field science, where many
scientists have their own data collection instruments and often curate datasets
themselves for a particular location for many years. Vast amounts of data are sitting on
local systems of many thousands of scientists, often called “dark data” [Heidorn
2008]. These datasets are often very specific to a locality or phenomenon, but they
are developed by the vast majority of scientists, known as “the long tail of science.”
Some report that less than 1% of data in ecology are available once they are analyzed
and results are published [Reichman et al 2011]. Although scientists would like to
share data, they often do not do so for four fundamental reasons [Science 2011]:
1. the paradigm makes data providers second class citizens, and for some
ecologists, data are a primary asset
2.
3.
4.</p>
      <p>data in ecology are complex, highly distributed and typically obtained to
answer local questions, and posting those data in ways that make them
discoverable and accessible requires a lot of work
the probability of posted data being discovered independent of the science
social network is low, reducing greatly the motivation to post the data
almost all data sharing today begins with scientific collaboration, and
traditional data sharing approaches are not linked to collaborative activities
Many current projects in geosciences depend on having broad access to data from
the long tail of science. Many observatory networks and initiatives such as
EarthCube1 envision the geoscience community coming together to ecosystem, regional,
continental, and global-scale problems. For example, to understand the carbon cycle
in water involves integrating data and analyses by scientists studying river, lake,
ocean, and coastal ecosystems. Critical research in ecology and geosciences can only
be addressed through the integration of data and models from thousands of scientists
spanning many disciplines (ocean, earth, and atmospheric sciences).</p>
      <p>These projects need the data to be shared, but furthermore the data needs to support
ad-hoc data sharing and collaborations. The data needs to be openly available and
well annotated with metadata so it can be aggregated and integrated. We need
approaches that part with the artificial walls created by traditional discipline-specific
data catalogs and infrastructure projects.</p>
      <p>We are investigating organic data sharing as a novel approach to data publishing
that is open to all scientists to contribute in many forms, requires minimal effort from
contributors, collects and exposes credit for all contributions, and has emergent
organization. Our work builds on three interrelated techniques: semantic web standards,
linked web of data principles, and popular web paradigms for interfaces such as
semantic wikis to annotate and aggregate data.</p>
      <p>This paper describes our initial work towards this vision. We begin with an
overview of the approach, followed by a walkthrough of a prototype that we have
developed to illustrate it. We also present a visionary scenario that shows how this
approach would open science to a broader set of contributors.
2</p>
      <p>Organic Data Sharing
Organic data sharing builds on three interrelated techniques:
1.</p>
      <p>Semantic web standards for defining semantic metadata in an extensible way
over web standards, including the use of RDF to define data types and
properties, which allow users either to reuse properties already defined in the system
or to easily add and use new properties.</p>
      <p>Linked data principles to expose datasets and their semantic metadata in an
open form on the Web. Traditional data repositories will upload data to a
central or distributed database, akin to a vault where the data is kept. In contrast,
1 http://www.nsf.gov/geo/earthcube/
linked data principles encourage all data and metadata to be web objects that
can be openly accessed by third-party web applications. There are vast and
rapidly growing amounts of linked data published in this format. They already
include large amounts of datasets relevant to ecology, such as geospatial data
(Geonames, OpenStreetMap), life sciences data (Gene Ontology, PDB), and
academic publications (PubMed, ACM), and Wikipedia info boxes (DBPedia).
3. Semantic wikis as popular web paradigms for interfaces and access to facilitate
the creation of simple tools of broad applicability to browse, visualize,
annotate, and integrate data. Semantic wikis augment traditional wikis so that the
hyperlinks between topic pages are annotated with a semantic relationship.
The contributors themselves can create the emergent structure of the content
by adding new properties in an as-needed basis.</p>
      <p>Our approach is to design an environment that supports scientists to carry out the
following activities:
•
•
•
•
•
•
any scientist can define collaborative tasks by stating questions that require
participation from the broader community
any scientist can contribute to those tasks, decompose them into subtasks if
appropriate, and request particular kinds of datasets
scientists can contribute datasets that they own simply by adding a pointer to
their datasets which will continue to reside in their local systems and under
their control
any scientist can add metadata to any datasets, defining new metadata
properties or adopting properties that others have used (or from common ontologies)
any scientist can change the metadata specified for any dataset in order to
adopt the same properties that other similar datasets use, facilitating
aggregation of data
any scientist can use any dataset, and must post the results of their analyses
with appropriate links to the original datasets that they used
The system will support organic data sharing by:
• assigning credit to each individual scientist by tracking, aggregating, and
exposing all their contributions of any nature
• pointing scientists towards tasks that could use their contributions by
analyzing the semantic properties available
• allowing users to import content that may be available as linked data
• publishing as linked data any content created by users
3</p>
      <p>An Illustration of Organic Data Sharing</p>
      <p>This section illustrates organic data publishing through an initial prototype that
extends a semantic wiki framework. Semantic MediaWiki builds on the popular
MediaWiki software, and extends them to allow users to express semantic relations2. We
2 http://semantic-mediawiki.org/
describe how the user interacts with the system in order to illustrate the capabilities of
the system.</p>
      <p>Figure 1 illustrates the variety of entities that can be linked to one another through
structured properties. In the figure, one window shows a wiki page for a dataset,
including semantic metadata properties that describe the collection instrument, location,
and time as well as the investigator who contributed it. That location happens to be a
lake, which is described in its own wiki page showed in a separate window in the
image, with its own geospatial and other semantic metadata properties. A third
window shows the wiki page for the investigator showing other contributed datasets and
other information that might provide context for the data. Anyone can edit the wiki,
add any metadata properties, extend metadata vocabulary, etc. All the information
collected through the site is published as Linked Data.</p>
      <p>All the contributors to each topic page are acknowledged, and there is a clear link
to the scientist that contributes each original dataset.</p>
      <p>The system enables contributors to easily define structured semantic properties to
describe the contents of the wiki, and uses RDF as the semantic representation
standard. Each wiki page describes an object of interest (eg, a dataset, a project) and has a
section of "Structured Properties", where contributors can specify properties and
values of the topic of the page. Any contributor can define new properties on the fly.
Any contributor can change an existing property to align it with one that is used
elsewhere, effectively normalizing the use of the property across pages and therefore
across objects. This results in an organic normalization of metadata properties for
datasets, which would typically result when datasets need to be aggregated for some
science purpose. Figure 2 shows an example of how the system creates content of
wiki pages dynamically through queries, in this case a query to show three properties
of lakes. Users browsing the site are immediately exposed to missing information
and can choose to contribute it. When the missing information stands in the way of
progress, they can be more motivated to add it.</p>
      <p>The framework has pre-defined categories of pages, each with their own with
predefined areas. We have defined so far five special categories: Question, Answer,
Data, Workflow, and ExecutedWorkflow.</p>
      <p>Figure 3 illustrates the special page category of Question. These are pages that
reflect a task or subtask. They have sub-questions that point to pages of category
Question as well. These subquestions may lead to request a dataset, as is the case in the
example shown in the figure. Some workflows may be designed and later executed
once the desired datasets are collected. When the question is answered, users can
create a page of another category, Answer, that would summarize all the findings and
perhaps include pointers to a publication. As any other page, question pages can have
structured properties, and each is credited to its author.</p>
      <p>Figure 4 illustrates the special page category of Data. These pages represent a
dataset, which can have as always structured properties. Some properties, as is the case
here, may be imported by the system from assertions available as Linked Data. Some
sections of the page are created dynamically through queries, for example to show
what workflows use the dataset as input (shown in orange in the figure).</p>
      <p>Figure 5 shows an example of a page with a special category of Workflow. In
this case, the workflow was created using a separate workflow system, Wings3, that
publishes workflows as Linked Data using OPMW4, an extension of the Open
Provenance Model [Garijo and Gil 2011]. The system imports the OPMW assertions and
shows the workflow in a wiki form. Again, anyone can add structured properties or
documentation to this page.
3 See http://www.wings-workflows.org
4 See http://www.opmw.org</p>
      <p>The framework incorporates the following major extensions to the semantic wiki:
• Contributions are driven towards answering global science questions is a
great incentive for participation of scientists. Answering these questions will
be the overarching goal, which will require contributors to do a variety of
tasks such as decomposing the high level questions into smaller tasks,
sharing datasets, describing data characteristics, preparing them, running models,
etc.</p>
      <p>Workflow technologies and provenance standards are embedded in the
framework to enable scientists to describe analytic processes that will
document new data products in terms of how they were obtained from raw data.
Worfklows are imported into the framework from Linked Data, where they
are published by the workflow system that created them. Workflows and
•
their results could also be added manually by users, for example if the steps
are run by hand or through scripts.</p>
      <p>Credit is given explicitly in every page and for every contribution. Credit is
aggregated per question and per user. Wikis provide a natural infrastructure
to track contributions, but they are typically hidden in the history tab of each
wiki page. The contributor of a dataset can see what question it is
contributing to and in what form (through the workflows that are using it).</p>
      <p>We continue to extend this prototype to exemplify the approach of organic data
sharing. We are working with the EarthCube community to identify additional
requirements from scientists. More research is needed regarding contributor credits and
data citations. We plan to explore different incentive and reward mechanisms that will
suit the contributor’s communities of practice. Another aspect we plan to investigate
is the viability of emerging semantics as the contributors normalize the attributes and
properties they use. We will analyze the drivers for convergence on semantic
properties, the practical reuse of community ontologies such as SWEET, and their effect on
productivity and data reuse.
4</p>
      <p>Discussion</p>
      <p>Quantitative data can be collected by instrumenting the system. We can use
standard wiki data collection metrics used in studies of wiki user behaviors and content
growth (e.g., the number of edits per user). We can also metrics particular to
semantic wikis (e.g., the number of structured properties defined).</p>
      <p>In addition to these more traditional wiki-style evaluations, we will be developing
science-relevant metrics such as the number of datasets collected and the number of
datasets aggregated through normalization of metadata properties. Another a novel
aspect involved in the evaluation of the system revolves around task decomposition,
task contributions, and task accomplishment that have not been addressed in prior
work on contributor involvement.</p>
      <p>We will need to explore alternative designs for the task-centered aspects of the
approach. Recent work on social creation of to-do lists offers an alternative approach to
creating and organizing subtasks [Kamar et al 2012]. Other successful examples for
enticing contributors to contribute to joint tasks have used common collaborative web
software [Rocca et al 2012]. Formative evaluations to compare these approaches
could be carried out to determine what works best for organic data sharing.</p>
      <p>We have identified four important dimensions of evaluation that are of interest:
participation, collaboration, convergence, and achievement of the community.</p>
      <p>Participation metrics can be used that are indicative of the involvement of users
from the community. We can create an estimate of the size of the community as the
total number of unique users who ever visit the site. The system can then collect the
total number of users who edit pages and contribute content to the site, the total
number of datasets contributed, and the total number of edits both collectively and per
user. Additional, participation metrics can be collected regarding the structured
properties defined in the semantic wiki, including the number of semantic properties added
by user and the number of semantic properties defined for each type of dataset.</p>
      <p>Collaboration metrics can indicate how users overlap in their activities as they
collaborate on specific topic pages in the wiki. Data can be collected regarding number
of users who edit the same topic page, the number of links across topic pages, and the
number of users that contribute to a given stated task or subtask.</p>
      <p>Convergence metrics will expose how the community normalizes structured
properties as the metadata is added for the diverse datasets. These metrics can include the
number of common properties across datasets used in a given task or workflow,
amount of unique users that adopt each property, the number of deprecated semantic
properties that are replaced by new (more broadly used) ones, and the evolution of
semantic properties over time. In addition, the amount of queries defined in wiki
pages to create dynamic content based on semantic properties would be an indicator
that the content is being aggregated across separate pages and contributors.</p>
      <p>Achievement measures the progress and accomplishments of task-oriented
contributions. The system can collect metrics regarding the amount of tasks and subtasks
created, the amount of data collection and workflow pages created associated with
tasks, the amount of user activity associated with each task and with wiki pages over
time, and the amount of subtasks with answers as indicators of accomplishment.</p>
      <p>We plan to extend the system to take on a more proactive role in soliciting
contributions. The system could do meta-analyses on the content at any given point in time,
determine what is needed, and prompt users accordingly. For example, it could
determine what tasks have not advanced for some time, propose decomposing them into
smaller subtasks that define contributions more specifically, and identify who could
be approached to make a specific needed contribution based on their past history.</p>
      <p>Central to our approach is the tracking and exposure of credit to individual
contributors on a topic-by-topic as well as an individual basis. It is important for the
system to track contributions of any size and nature, ranging from contributions that
require significant effort (e.g., the contribution of a dataset that took months to
collect), to very small effort (e.g., the renaming of a property of a dataset to standardize
names across datasets), and any effort in between (e.g., the addition of a metadata
property to a dataset that required analyzing the data to decide on the property value).
Another important aspect of the system is to reflect the credit for user contributions
whenever content is presented, whether it is overall user credit in a user page, or
ranked credits to all users for a given topic page. Ranking contributors in scoreboards
appears to be a great incentive in social computing systems, and we will explore this.</p>
      <p>For owners of datasets (the dark data from the long tail), the explicit links from the
data to the scientific problems it is used for will address the concern of the recognition
of their contributions to problems. Another issue that this will address is that they
will be able to inspect that their data is used for appropriate goals and with
appropriate transformations to fit the models used in the analyses. A benefit for them will also
be that their future data collection efforts will put them in a position of being able to
re-run the analyses with the new data. Currently, they typically lack the knowledge
about how to run models as well as access to their codes. These issues will be
addressed by the availability of the analyses in the system.</p>
      <p>In the end, the credit tracked and acknowledged in our system must be recognized
in the traditional forms of credit in science as scientific publications. The credit
tracking in our approach will have to be combined with social rules that set expectations
about how contributors are acknowledged in any resulting publications. An approach
taken in Polymath is that the author is named as “Polymath” and a pointer to the web
site is provided where all contributors are acknowledged in detail together with the
nature of their contributions. We will explore together with the scientists in the
community what would be appropriate acknowledgements in publications.
5</p>
      <p>Conclusions</p>
      <p>We presented organic data sharing as a novel approach to collect dark data from
the long tail of science in a form that can be enticing to scientists and including
metadata annotations that make the data most usable. There are many potential
benefits of the proposed approach: 1) the publication of data and metadata is virtually
instantaneous, so is its access; 2) each scientist is personally responsible and in charge
of the publication of their data; 3) scientists, students, citizens, and policy makers can
all be contributors; 4) data descriptions can be created in an ad-hoc manner, and
normalized and integrated in an as needed basis; 5) everyone else benefits when someone
invests in describing, normalizing, or aggregating data; and 6) the immediate benefits
to each scientist should be enough to make data publishing and metadata creation
become a pleasant habit rather than a chore.</p>
      <p>There is already a success story of scientific sharing that has these properties. The
Web has all these properties: instantaneous, personal, participatory, self-organizing,
empowering, and addictive. We are building on web infrastructure to foster the
creation of a web of data for environmental science.</p>
      <p>Acknowledgements. This research was supported in part by a grant from the
National Science Foundation through award number IIS-1117281.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Garijo</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Gil</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          “
          <article-title>A New Approach for Publishing Workflows: Abstractions, Standards, and Linked Data”</article-title>
          .
          <source>In Proc. WORKS'11</source>
          ,
          <string-name>
            <surname>Seatle</surname>
          </string-name>
          , WA,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Heidorn</surname>
          </string-name>
          , P.B. “
          <article-title>Shedding Light on the Dark Data in the Long Tail of Science</article-title>
          .”
          <source>Library Trends</source>
          , Vol.
          <volume>57</volume>
          , No. 2,
          <string-name>
            <surname>Fall</surname>
          </string-name>
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Kamar</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hacker</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          and
          <string-name>
            <surname>E. Horvitz.</surname>
          </string-name>
          “
          <source>Combining Human and Machine Intelligence in Large-scale Crowdsourcing,” AAMAS</source>
          <year>2012</year>
          , Valencia, Spain,
          <year>June 2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Reichman</surname>
            ,
            <given-names>O.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>M.B.</given-names>
          </string-name>
          , and
          <string-name>
            <given-names>M.P.</given-names>
            <surname>Schildhauer</surname>
          </string-name>
          . “
          <source>Challenges and Opportunities of Open Data in Ecology.” Science</source>
          , Vol.
          <volume>331</volume>
          no. 6018 pp.
          <fpage>703</fpage>
          -
          <lpage>705</lpage>
          ,
          <year>February 2011</year>
          , DOI: 10.1126/science.331.6018.692.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Rocca</surname>
            <given-names>RA</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Magoon</surname>
            <given-names>G</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reynolds</surname>
            <given-names>DF</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krahn</surname>
            <given-names>T</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tilroe</surname>
            <given-names>VO</given-names>
          </string-name>
          , et al. “
          <article-title>Discovery of Western European R1b1a2 Y Chromosome Variants in 1000 Genomes Project Data: An Online Community Approach</article-title>
          .”
          <source>PLoS ONE 7</source>
          (
          <issue>7</issue>
          ),
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Science</surname>
          </string-name>
          ,
          <year>2011</year>
          ,
          <string-name>
            <surname>Special</surname>
          </string-name>
          <article-title>Issue on Challenges and Opportunities</article-title>
          . Vol.
          <volume>331</volume>
          no. 6018 pp.
          <fpage>692</fpage>
          -
          <lpage>693</lpage>
          ,
          <year>February 2011</year>
          , DOI: 10.1126/science.331.6018.692.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>