<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Using Semantic Web Technologies to Reproduce a Pharmacovigilance Case Study</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Michiel Hildebrand</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rinke Hoekstra</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jacco van Ossenbruggen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Centrum Wiskunde &amp; Informatica</institution>
          ,
          <addr-line>Amsterdam</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>VU University Amsterdam</institution>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <fpage>15</fpage>
      <lpage>27</lpage>
      <abstract>
        <p>We provide a detailed report of a reproduction study of a paper published in the International Journal of Medical Sciences (IJMS). We rst use the PROV-O ontology to model our reconstruction of the computational work ow of the original experiment and to systematically explicate all information that is needed for an reproduction study. We then identify which part of the required information is published in the IJMS paper and what part is missing. We then discuss our reproduction of this work ow, following the original as much as possible. Again, we use PROV-O to precisely de ne our version of the work ow, including our version of the information that was missing in the IJMS paper of the study. Finally, we generalize from the speci c cased described in the original paper by providing a web service that allows mining for arbitrary drug-adverse event pairs.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Reproducing scienti c results is often more an art than science. By describing
a concrete case study we show how we used PROV-O to systematically analyse
a paper from a di erent eld, written by authors we do not personally know.
We attempt to reconstruct the provenance graph of the original experiment by
carefully studying the description of the method, the statistics and the results
provided either directly in the paper or other sources that the paper refers to.
We formalized our reconstruction using the PROV-O ontology. The formalization
makes the dependencies between the intermediate steps explicit, which should
allow us to systematically investigate how the results presented in the paper
were computed. To reproduce the results we need to understand the input and
output behavior of the computations modeled by the prov:Activity nodes. The
properties of the input and output prov:Entity can help to verify wether this
understanding is correct.</p>
      <p>
        The paper we selected is the Open Access article Adverse Event Pro les of
5-Fluorouracil and Capecitabine: Data Mining of the Public Version of the FDA
Adverse Event Reporting System, AERS, and Reproducibility of Clinical
Observations published in the International Journal of Medical Sciences (IJMS) [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
The paper describes a computational data mining study on public data and
appears to be a good candidate for a reproduction study. We use this paper as a
case study to provide insights in the problem of reproducing scienti c results,
we do not aim to criticize this particular paper in any way.
      </p>
      <p>
        The topic of the paper is an example of pharmacovigilance which is de ned
by the World Health Organization as \the science and activities relating to the
detection, assessment, understanding and prevention of adverse e ects or any
other medicine-related problem"3. Computational studies play an important role
in Pharmacovigilance to detect drug side-e ects. Such studies are an economic
way to generate hypotheses before performing costly clinical reviewing [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. The
IJMS paper [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] follows a typical scenario in pharmacovigilance: the use of a
database with reports of adverse events (AE) to nd disproportional correlations
between a drug and an adverse reaction. In this example, the Adverse Event
Reporting System of the US Food and Drug Administration (FAERS) is used to
compare adverse e ects of drugs.
      </p>
      <p>While the FAERS database itself is publicly available, it is not trivial to
reproduce the results of the experiments that use this database. Results and
tools are described in scienti c publications, but tools and (intermediate) results
are typically not available. Our case study demonstrates in detail what prevents
reproduction. From the observations of this study we derive initial requirements
to support studies of drug side-e ects that can be fully reproduced.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related work</title>
      <p>This section gives a brief overview of related work on data publication, scienti c
work ows and provenance.</p>
      <p>
        The requirement for reproducibility [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] has been a key motivator for an
increased interest in data sharing and publication, especially in elds dealing
traditionally with ever growing datasets, e.g. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Even though data sharing does
not always immediately bene t the individual researcher, the potential for the
scienti c community is signi cant [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Funding agencies, keen on maximizing
impact and reducing fraud, are now actively requiring data sharing. For
example, both the US National Science Foundation and the EU now require data
management plans for all proposals they consider.4 Note that also in areas that
focus on human action, such as in human computer interaction, replication has
part of the research agenda5.
      </p>
      <p>
        However, as becomes clear in this paper as well, raw data publication (such as
FAERS) is in itself not su cient for reproducible research. Data often needs to
be moulded and transformed to a new data model before it becomes suitable for
answering a particular research question. This data preparation step can take
between 60 to 80% of data-oriented research tasks [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Work ow systems [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ],
provide mechanisms for reproducing scienti c conclusions, based on shared data.
3 http://www.who.int/medicines/areas/quality_safety/safety_efficacy/
pharmvigi/en/
4 See http://www.nsf.gov/bfa/dias/policy/dmp.jsp and
      </p>
      <p>
        http://europa.eu/rapid/press-release_SPEECH-13-236_en.htm
5 http://www.cs.nott.ac.uk/~mlw/replichi.php
The bene t for individual researchers publishing a work ow, is that work ows
are executable procedures that can be run against various inputs. Work ows can
be shared and reused through social platforms such as myExperiment6. Curated
work ow descriptions [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] combined with original data, can serve as self-contained
research objects [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        There are, however, two drawbacks to using a work ow system. Firstly,
workow descriptions are inevitably tied to the system used, and thus constrained
to the types of operations supported by the system. Secondly, not all steps of
interest in a scienti c research process are necessarily of a computational nature,
e.g. consider the information conveyed through the reuse of texts in scienti c
discourse. Though in its early stages, work on automatic provenance
reconstruction [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] is a promising approach to making explicit the temporal and causal
dependencies between individual elements of scienti c output.
      </p>
      <p>
        The overarching requirement for reproducible research is an explicit account
of what processes and activities led from original input, albeit data, texts, other
media, to the contribution of a scienti c publication. The PROV standard of
the W3C [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], based on ten earlier provenance models, such as the Open
Provenance Model7 and the Provenance Vocabulary8, provides a standard vocabulary
and semantics for expressing plans (work ows), process execution, dependencies
between entities and processes, and agent involvement. The PROV-O ontology
is a vocabulary for expressing PROV as Linked Data.9 Most scienti c work ow
systems allow provenance tracking of work ow execution, and allow exporting
it to PROV or a compatible format. The consumption of provenance
information by applications is gradually receiving more attention [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. The ProvBench
repository10 has the objective to bootstrap the development of systems for the
visualization, analysis and understanding of provenance graphs.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Basic concepts in Pharmacovigilance</title>
      <p>Various organizations maintain reporting systems of adverse events. The World
Health Organization (WHO) maintains vigiBase, the US Food and Drug
Administration (FDA) maintains the Adverse Event Reporting System (FAERS) and
many countries maintain their own system. These organisations provide
functionality for medical professionals to submit reports of adverse events that they
encountered with their patients. A report in the FAERS database contains a
list of the medication that the patient received and a list of adverse events. In
addition it may contain information about the patient such as the gender and
age. Unique of the the FAERS database is that it is publicly available on the
Web. XML and CSV les for every yearly quarter starting at 2004 are available
for download.
6 See http://www.myexperiment.org
7 See http://purl.org/net/opmv/ns
8 See http://purl.org/net/provenance/ns
9 See http://www.w3.org/TR/prov-o/.
10 https://sites.google.com/site/provbench/</p>
      <p>Adverse event databases are used in pharmacovigelance research to detect
side e ects of drugs. An important part of this research focusses on the
detection of side e ects of new drugs that appear on the market. The WHO has an
extensive program for this research3, and involves large scale data mining of
adverse event databases. Other research focusses on the side e ects of sets of
speci c drugs. These studies are typically motivated by clinical evidence.</p>
      <p>
        Both types of research depend on methods to detect a disproportional
correlation between a drug and an associated adverse event. The most common methods
are the proportional reporting ratio (PRR), the reporting odds ratio (ROR), the
information component (IC) and the empirical Bayes geometric mean (EBGM).
All are based on the expected frequency relative to all drug event pairs that are
available in the database. Calculating signals with these methods requires a 2x2
contingency table, as shown in Table 1. This table contains (a) the number of
mentions of a drug together with a mention of a reaction (an adverse event), (b)
the number of mentions of all other drugs and that reaction, (c) the number of
mentions of the drug and all other reactions and (d ) the number of mentions of
all other drugs and all other reactions. According to [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] the PRR is calculated
from this table using Eq. 1.
(1)
(2)
P RR =
a=(a + c)
b=(b + d)
The expected value for a PRR is one and values above it indicate the strength
of the association. In addition, the strength of a statistical association can be
calculated using a standard chi-squared test.
      </p>
      <p>2 =</p>
      <p>(ad bc)2(a + b + c + d)
(a + b)(c + d)(b + d)(a + c)</p>
      <p>
        According to [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] a signal is detected between a drug and an adverse event if
the PRR is at least 2, the chi-squared is at least 4 and there are at least 3 or
more cases mentioning the drug and the event. We refer to the literature for the
details of ROR [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], IC [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and EBGM [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. A comparison of these methods is
reported in [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ].
      </p>
      <p>To compute the 2x2 contingency table one needs to collect all mentions of
a particular drug and a particular adverse event. Collecting the adverse event
mentions is straightforward because in the FAERS database they are consistently
identi ed with the preferred terms from the Medical Dictionary for Regulatory
Activities11 (MedDRA). The drug names in the FAERS database are, however,
not standardized. The same drug may be entered in to the database in various
forms. For example, drug names are entered with or without dosage information,
method (e.g. oral, injection) and other additions. Some have entered the drug
name, while others used the brand or trade name and again others the active
ingredient. There are various spelling variations and synonyms. To properly ll
the 2x2 contingency table one has to deal with the variations in drug names.
11 http://www.meddramsso.com/</p>
      <sec id="sec-3-1">
        <title>Drug of interest All other drugs</title>
      </sec>
      <sec id="sec-3-2">
        <title>Reaction of interest All other reactions a c</title>
        <p>
          a+c
b
d
a+b
c+d
b+d
a+b+c+d
Our target IJMS paper [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] investigates the so called safety pro les of two types
of drugs that are used in the treatment of cancer. The rst drug is 5-Fluorouracil
(5-FU), which was traditionally used for the treatment of solid tumors. This drug
was given by injection or infusion. Due to the high risks and costs of this type
of treatment the pharmaceutical industry developed a class of drugs known as
oral uoropyrimidines, from which Capecitabine is the most well known one.
Clinical trials that compared the use of Capecitabine against 5-FU favor the use
of the rst. Due to limitations of the clinical trials the picture is, however, not
complete. For example, the trials do not provide evidence for adverse events that
occur at relative low frequencies. The aim of the paper is to test the conclusions
drawn from the trials and provide additional evidence for lower frequency adverse
events.
        </p>
        <p>In the IJMS paper the authors describe the method to detect the signals
for Capecitabine and 5-FU with various adverse events. As a rst step towards
reproduction of this study we formalized the steps and their dependencies using
the PROV-O ontology. In addition, we describe the information provided in the
paper that could help in the reproduction.</p>
        <p>Original FDA datasources The work ow starts at the bottom with datasources
obtained from the FDA. The website of the FDA contains ZIP les for each
yearly quarter12. For each quarter there are two versions available, ASCII and
SGML. The former contains a dump of the database in the form of 7 CSV les,
while the latter contains a single SGML le. The authors of the IJMS paper used
the ASCII versions for the rst quarter of 2004 up to the last quarter of 2009, a
12 http://www.fda.gov/Drugs/GuidanceComplianceRegulatoryInformation/</p>
        <p>Surveillance/AdverseDrugEffects/ucm083765.htm
total of 24 les. In the provenance graph the quarterly les are represented by
individual nodes, but for the sake of clarity we do not show all nodes. The ZIP
les from the FDA contain a document that describes the structure of the CSV
les and instructions how to interpret them.</p>
        <p>Report aggregation The paper mentions that the total dataset contains 2,231,029
reports. From this we conclude that an aggregation step was performed. In the
provenance graph the aggregated dataset is represented by the node with the
label A.FAERS. This aggregated dataset is the starting point for two cleanup
activities. First, the authors removed super uous reports as the data contains
updated versions of a report as separate records. The paper refers to the
documentation from the FDA in which it is advised to keep only the most recent
report for a speci c case. The resulting dataset is labeled B.FAERS in the
provenance graph, and contains (according to the paper) 1,644,220 reports.
Drug name normalization In the second cleanup step the drug names are
normalized: all drug names were uni ed into generic names by a text-mining
approach. The paper does not provide details of this text-mining approach. The
paper does explain that the cleanup includes the correction of spelling errors.
For this purpose GNU Aspell is used to detect spelling errors and the suggested
corrections are manually con rmed by working pharmacists. It is unclear how
many spelling corrections were made. Finally, foods, beverages, treatments (e.g.
X-ray radiation), and unspeci ed names (e.g. beta-blockers) were removed. It is
unclear from the paper if this removal step is manual or automatic. The result
of the normalization activity is represented in the provenance graph by the node
C.FAERS.</p>
        <p>Co-occurrence selection The paper mentions that after the drug name
normalization the dataset contains 22,017,956 co-occurrences of drugs and events. A
drugname and an adverse event co-occur if they are mentioned together in a
report. The activity of counting co-occurrences is modeled as an explicit step
and the output is the node with label D.co-occurrences.</p>
        <p>Contingency table To compute the PRR values from the set of co-occurrences a
2x2 contingency table is required for each drug-adverse event pair. Populating
the table requires the selection of the required subsets of co-occurrences. The
graph contains activities to create the tables for 5-FU with Leukopenia and
Capecitabine with Leukopenia. The resulting tables are shown as the nodes
5FU-Leukopenia and Capecitabine-Leukopenia. The IMJS paper does not
explicitly contain the 2x2 table for any of the drug-adverse event pairs, but
using the values mentioned in the paper we can partially reconstruct the tables,
see Table 2. In this table the values in bold font come from the paper, the italic
ones can be trivially calculated from these. The question marks represent values
which we will try to reverse engineer in the next section.
?
?
?
?
?
?
?
?
40,284
34,928</p>
      </sec>
      <sec id="sec-3-3">
        <title>Capecitabine</title>
        <p>PRR values The nal PRR values and the results of the chi-squared test are
provided in the IMJS paper. In the provenance graph they are represented as
the end nodes, e.g. PRR 5FU-Leukopenia. Note that to recalculate the values for
the PRR and chi-squared tests we need to obtain the missing values in Table 2.
5</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Reproduction experiment</title>
      <p>We rst tried to recalculate the missing numbers in the 2x2 contingency tables
using the information given in the paper. Next we tried to reproduce the subsets
of drug-adverse event pairs that underly the 2x2 co-occurrences using the original
FAERS data from the FDA website, and thus reproducing the entire work ow.
Further details of the reproduction are available at the Website accompanying
this paper http://www.few.vu.nl/~michielh/lisc2013/.
5.1</p>
      <sec id="sec-4-1">
        <title>Missing numbers and formulas</title>
        <p>Using the PRR values given in the paper and the PRR formula cited by the
paper, we should be able reconstruct the missing values from the 2x2 contingency
tables. Note that while we do not know values for b, the number of mentions of
an adverse event in co-occurrence with all other drugs, we do know the values
for (b + d). Based on Eq. 1, we should thus be able to calculate the values for b
by using Eq. 3.</p>
        <p>a=(a + c)
b = (b + d) (3)</p>
        <p>P RR</p>
        <p>Knowing b, we should also be able to compute the total number of mentions
of an adverse event in the database a + b. For example, using the PRR value
40,284
34,928</p>
        <p>28,585
21,949,087</p>
        <p>28,747
21,954,281</p>
        <sec id="sec-4-1-1">
          <title>Capecitabine</title>
          <p>
            for 5-FU (5.282) and the numbers from the partial contingency table, Table 2,
the total number of mentions of Leukopenia should be 28,887. Surprisingly, this
number is di erent when calculated from the PRR for Capecitabine (2.520),
namely 28,952. For the other adverse events mentioned in the paper we also
found a di erence when calculated with the PRR of 5-FU or with the PRR of
Capecitabine. These di erences are all bigger than can be explained by rounding
errors. After more in-depth literature study we discovered that di erent formulas
are used to calculate the PRR. For example, the IJMS paper also cites [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ] that
uses the formula given in Eq. 4:
          </p>
          <p>P RR2 =</p>
          <p>a=(a + c)
(a + b)=(a + b + c + d)
Unfortunately, we do not get a constant number for a + b with this formula
either. However, after some experimentation we discovered that with Eq. 5 we
achieve a constant number for the mentions of Leukopenia, 28,862. Also for the
other adverse events this formula results in a constant number. From this we
conclude that while Eq. 5 is given nor cited by the IMJS paper, it is most likely
the formula used to calculate all PRR values mentioned in the paper (!).</p>
          <p>P RR3 =</p>
          <p>a=c
(a + b)=(a + b + c + d)
Now that the total number of mentions of Leukopenia is known (a+b) we can
complete the 2x2 contingency tables, see Table 3. Using this table it is also
possible to, modulo rounding errors, successfully reproduce the values from the
chi-squared tests with Eq. 2. Now we know how to compute the basis statistics
reported by the paper, we can try to reproduce the entire experiment.
(4)
(5)
5.2</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>Work ow reproduction</title>
        <p>As it is unclear how the drug name normalization was performed, we decided not
to reproduce this on the entire dataset. We focus on the two drugs mentioned in
the IMJS paper: 5-FU and Capecitabine. Our aim is to approximate the PRR
values for these drugs and Leukopenia. The provenance graph of our
reproduction is available at http://www.few.vu.nl/~michielh/lisc2013/prov/. We
encourage the reader to access this graph. The prov:Entity nodes in this graph
are clickable and point to the underlying data. In this way we provide access to
the intermediate datasets, which is an essential ingredient to successful
reproduction of computational work ows. Currently, we are investigating normalization
of all drug names in the FAERS dataset.</p>
        <p>Original FDA datasources Similarly as the study reported in the IMJS paper
we downloaded the 24 quarterly dumps (from the beginning of 2004 to the end
of 2009) from the FDA website.</p>
        <p>Report aggregation by conversion to RDF We choose to aggregate the quarterly
les into a single dataset by rst converting them to RDF and then storing these
in a triple store. The total number of reports in our RDF dataset is 2,231,038,
this is 9 reports more than reported in the IMJS paper. It is unclear where
the di erence comes from. We can, however, con rm that the conversion to
RDF did not alter the original reports, as the original CSV les combined also
contain 2,231,038 unique report identi ers13. The conversion from CSV to RDF
was performed using SWI-Prolog and the RDF conversion toolset14. Details of
the conversion, the resulting RDF and the SPARQL endpoint are available at
http://www.few.vu.nl/~michielh/lisc2013.</p>
        <p>Duplicate removal The duplicate removal step was performed on the RDF dataset.
We rst grouped all reports with the same case number and for each group
selected the report with the highest report identi er. We removed the other reports
from the database. The resulting dataset contains 1,664,078 reports, this is 142
less than reported in the IMJS paper. We can't explain this di erence.
Drug name normalization Instead of normalizing all the drug names, we tried
to nd all the mentions for our drugs of interest: 5-FU and Capecitabine. We
explored four methods to nd di erent mentions for these drug names.
1. We selected the mentions that contain the drug name itself. For Capecitabine
this returns many mentions of capecitabine, but also many variations such as
capecitabine tablet 1000 mg, capecitabine roche laboratories inc and capecitabine
2000 mg po as divided doses daily. In total we nd 337 di erent mentions
containing Capecitabine.
13 The total number of unique report identi ers in the CSV les from the FDA is
computed with a unix bash script: cut -d$ -f1 DEMO0*.TXT | sort -u | wc -l
14 http://semanticweb.cs.vu.nl/xmlrdf/</p>
        <p>Co-occurrence selection Without drug name normalization our dataset contains
a total of 23,865,029 drug-adverse event co-occurrences, 1,847,073 more than
reported in the IMJS paper. This larger number of co-occurrences can be
explained by the fact that we did not remove foods, beverages, treatments (e.g.
X-ray radiation), and unspeci ed names (e.g. beta-blockers), as was mentioned
in the IMJS paper. In addition, drug names for a single report may contain
multiple treatments each containing a di erent drug mention. For example, a
report may contain treatment with the mention capecitabine 500 MG and another
with the mention capecitabine 1000 MG. In other words, the patient received
two treatments, and in the second treatment the dosage of Capecitabine was
increased. Without drug name normalization these mentions are counted as two
co-occurrences, whereas after normalization they will be counted as a single
cooccurrence. Considering the formula for PRR in Eq. 5 this di erence is re ected
in the denominator, the total number of co-occurrences (a+b+c+d ) as well as
the total number of co-occurrences with a speci c adverse event (a+b).
Contingency table Using the four methods to nd drug mentions we selected
the set of co-occurrences corresponding to the cells of the 2x2 contingency. The
total number of co-occurrences with Leukopenia (a+b) that we found is 30,724.
15 http://www.drugbank.ca/
A di erence of 1,862 with the number reported in the IMJS paper. This can
also be explained by the lack of drug name normalization. The total number
of co-occurrences with 5-FU is 42,115, 1831 more than reported in the IMJS
paper. For Capecitabine 37,973 co-occurrences are found, 3045 more than in
the IMJS paper. We conclude that the four drug name selection methods nd
more mentions of the two drugs. Currently we are investigating if and why
drug mentions are falsely included. We found 289 co-occurrences for 5-FU with
Leukopenia. This is 12 more than reported in the IMJS paper. For Capecitabine
122 co-occurrences were found with Leukopenia, 7 more than the 115 reported
in the IMJS paper.</p>
        <p>PRR values Using the values in the reproduced 2x2 contingency tables and Eq. 5
the PRR for 5-FU with Leukopenia is 5.367 compared to 5.282. The chi-squared
test is 1019.763 compared to 952.334. For Capecitabine with Leukopenia the
PRR is 2.503 compared to 2.520 in the IMJS paper, and the chi-squared test is
109.661 compared to 103.730.
6</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Discussion</title>
      <p>Reproducing the study described in the IMJS paper required substantial
effort, it was di cult to verify the results of the intermediate datasets and
almost impossible to analyze the di erences in the reproduction. And this is all
despite the fact that the IMJS paper of the case study at rst sight clearly
describes the method and results. By formalizing the computational work ow in
PROV-O it became possible to systematically investigate the intermediate steps.
We believe that sharing such provenance graphs is a rst step in simplifying the
reproduction of computational work ows. The next step is to also make the
content of the prov:Entity nodes available, and ultimately the computational
processes that underly the prov:Activity nodes. We hope that the clickable
provenance graph we made available at http://www.few.vu.nl/~michielh/
lisc2013/prov/ serves as an example.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work was funded under the Dutch COMMIT program as part of the
Data2Semantics and SEALINCMedia projects.</p>
      <p>PRR 5FU-Leukopenia
5.282
chi-squared 5FU-Leukopenia
952.334</p>
      <p>PRR Capecitabine-Leukopenia
2.520
wasGeneratedBy
wasGeneratedBy
5FU-Leukopenia
2x2 table
used</p>
      <p>used
Capecitabine-Leukopenia</p>
      <p>2x2 table
wasGeneratedBy</p>
      <p>wasGeneratedBy
2x2 table selection</p>
      <p>used
co-occurrence selection</p>
      <p>used
C. FAERS
wasGeneratedBy
wasGeneratedBy
drug name normalization
duplicate report removal</p>
      <p>pharmacists
A. FAERS
2,231,029
wasGeneratedBy
wasAttributedTo
hadPrimarySource</p>
      <p>hadPrimarySource
AERS 2004Q1</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>H.</given-names>
            <surname>Akil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. E.</given-names>
            <surname>Martone</surname>
          </string-name>
          , and
          <string-name>
            <surname>D. C. Van Essen. Challenges</surname>
          </string-name>
          <article-title>and opportunities in mining neuroscience data</article-title>
          .
          <source>Science</source>
          ,
          <volume>331</volume>
          (
          <issue>6018</issue>
          ):
          <volume>708</volume>
          {
          <fpage>712</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>A.</given-names>
            <surname>Bate</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lindquist</surname>
          </string-name>
          , I. Edwards,
          <string-name>
            <given-names>S.</given-names>
            <surname>Olsson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Orre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lansner</surname>
          </string-name>
          , and
          <string-name>
            <surname>R. De Freitas</surname>
          </string-name>
          .
          <article-title>A bayesian neural network method for adverse drug reaction signal generation</article-title>
          .
          <source>European journal of clinical pharmacology</source>
          ,
          <volume>54</volume>
          (
          <issue>4</issue>
          ):
          <volume>315</volume>
          {
          <fpage>321</fpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>K.</given-names>
            <surname>Belhajjame</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Corcho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Garijo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Missier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Newman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Palma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bechhofer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Garcia-Cuesta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gmez-Prez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Klyne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Page</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Roos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ruiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Soiland-Reyes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Verdes-Montenegro</surname>
          </string-name>
          , D. De Roure, and
          <string-name>
            <given-names>C.</given-names>
            <surname>Goble</surname>
          </string-name>
          .
          <article-title>A work ow-centric research objects: A rst class citizen in the scholarly discourse</article-title>
          .
          <source>In Proceedings of the ESWC2012 Workshop on the Future of Scholarly Communication in the Semantic Web (SePublica2012)</source>
          , Heraklion, Greece, May
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>E.</given-names>
            <surname>Deelman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Gannon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Shields</surname>
          </string-name>
          ,
          <string-name>
            <surname>and I. Taylor.</surname>
          </string-name>
          <article-title>Work ows and e-science: An overview of work ow system features and capabilities</article-title>
          .
          <source>Future Generation Computer Systems</source>
          ,
          <volume>25</volume>
          (
          <issue>5</issue>
          ):
          <volume>528</volume>
          {
          <fpage>540</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>S. J. W.</given-names>
            <surname>Evans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. C.</given-names>
            <surname>Waller</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Davis</surname>
          </string-name>
          .
          <article-title>Use of proportional reporting ratios (prrs) for signal generation from spontaneous adverse drug reaction reports</article-title>
          .
          <source>Pharmacoepidemiology and Drug Safety</source>
          ,
          <volume>10</volume>
          (
          <issue>6</issue>
          ):
          <volume>483</volume>
          {
          <fpage>486</fpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>D.</given-names>
            <surname>Garijo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Alper</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Belhajjame</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Corcho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gil</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Goble</surname>
          </string-name>
          .
          <article-title>Common motifs in scienti c work ows: An empirical analysis</article-title>
          .
          <source>In 8th IEEE International Conference on eScience, USA</source>
          ,
          <year>2012</year>
          . IEEE Computer Society Press.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>A.</given-names>
            <surname>Goald</surname>
          </string-name>
          .
          <article-title>Practical pharmacovigilance analysis strategies</article-title>
          .
          <source>Pharmacoepidemiology and drug safety</source>
          ,
          <volume>12</volume>
          (
          <issue>7</issue>
          ):
          <volume>559</volume>
          {
          <fpage>574</fpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>C.</given-names>
            <surname>Goble</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Stevens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hull</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Wolstencroft</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Lopez</surname>
          </string-name>
          .
          <article-title>Data curation + process curation=data integration + science</article-title>
          .
          <source>Brie ngs in Bioinformatics</source>
          ,
          <volume>9</volume>
          (
          <issue>6</issue>
          ):
          <volume>506</volume>
          {
          <fpage>517</fpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>P.</given-names>
            <surname>Groth</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Frew</surname>
          </string-name>
          .
          <source>Proceedings of the 4th international conference on provenance and annotation of data and processes</source>
          .
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>P.</given-names>
            <surname>Groth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gil</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Magliacane</surname>
          </string-name>
          .
          <article-title>Automatic metadata annotation through reconstructing provenance</article-title>
          .
          <source>In Third International Workshop on the role of Semantic Web in Provenance Management, ESWC</source>
          <year>2012</year>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>P.</given-names>
            <surname>Groth</surname>
          </string-name>
          and
          <string-name>
            <given-names>L.</given-names>
            <surname>Moreau</surname>
          </string-name>
          .
          <article-title>PROV-Overview: An Overview of the PROV Family of Documents</article-title>
          . Working group note, W3C, Apr.
          <year>2013</year>
          . http://www.w3.org/TR/2013/ NOTE-prov-overview-
          <volume>20130430</volume>
          /. Latest version available at http://www.w3.org/ TR/prov-overview/.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>K.</given-names>
            <surname>Kadoyama</surname>
          </string-name>
          , I. Miki,
          <string-name>
            <given-names>T.</given-names>
            <surname>Tamura</surname>
          </string-name>
          , J. Brown, T. Sakaeda, and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Okuno</surname>
          </string-name>
          .
          <article-title>Adverse event pro les of 5- uorouracil and capecitabine: Data mining of the public version of the fda adverse event reporting system, aers, and reproducibility of clinical observations</article-title>
          .
          <source>International Journal of Medical Sciences</source>
          ,
          <volume>9</volume>
          (
          <issue>1</issue>
          ):
          <volume>33</volume>
          {
          <fpage>39</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>M. Liu</surname>
            ,
            <given-names>M. E.</given-names>
          </string-name>
          <string-name>
            <surname>Matheny</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Hu</surname>
            , and
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Xu</surname>
          </string-name>
          .
          <article-title>Data mining methodologies for pharmacovigilance</article-title>
          .
          <source>ACM SIGKDD Explorations Newsletter</source>
          ,
          <volume>14</volume>
          (
          <issue>1</issue>
          ):
          <volume>35</volume>
          {
          <fpage>42</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Mesirov</surname>
          </string-name>
          . Accessible reproducible research. Science,
          <volume>327</volume>
          (
          <issue>5964</issue>
          ):
          <volume>415</volume>
          {
          <fpage>416</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <given-names>H. A.</given-names>
            <surname>Piwowar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Day</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D. B.</given-names>
            <surname>Fridsma</surname>
          </string-name>
          .
          <article-title>Sharing detailed research data is associated with increased citation rate</article-title>
          .
          <source>PloS one</source>
          ,
          <volume>2</volume>
          (
          <issue>3</issue>
          ):e308,
          <string-name>
            <surname>Jan</surname>
          </string-name>
          .
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <given-names>K.</given-names>
            <surname>Rothman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lanes</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Sacks</surname>
          </string-name>
          .
          <article-title>The reporting odds ratio and its advantages over the proportional reporting ratio</article-title>
          .
          <source>Pharmacoepidemiology and drug safety</source>
          ,
          <volume>13</volume>
          (
          <issue>8</issue>
          ):
          <volume>519</volume>
          {
          <fpage>523</fpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <given-names>A.</given-names>
            <surname>Szarfman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Machado</surname>
          </string-name>
          , and
          <string-name>
            <surname>R.</surname>
          </string-name>
          <article-title>O'Neill</article-title>
          .
          <article-title>Use of screening algorithms and computer systems to e ciently signal higher-than-expected combinations of drugs and events in the us fda's spontaneous reports database</article-title>
          .
          <source>Drug safety : an international journal of medical toxicology and drug experience</source>
          ,
          <volume>25</volume>
          (
          <issue>6</issue>
          ):
          <volume>381</volume>
          {
          <fpage>392</fpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18. E. van
          <string-name>
            <surname>Puijenbroek</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Bate</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Leufkens</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Lindquist</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Orre</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Egberts</surname>
          </string-name>
          .
          <article-title>A comparison of measures of disproportionality for signal detection in spontaneous reporting systems for adverse drug reactions</article-title>
          .
          <source>Pharmacoepidemiology and drug safety</source>
          ,
          <volume>11</volume>
          (
          <issue>1</issue>
          ):3{
          <fpage>10</fpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <source>chi-squared Capecitabine-Leukopenia 103.730 wasGeneratedBy wasGeneratedBy 2x2 table selection used</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>