<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Toward A Robust Method for Understanding the Replicability of Research</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ben Gelman</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chae Clark</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Scott Friedman</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ugur Kuter</string-name>
          <email>ukuterg@sift.net</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>James Gentile</string-name>
          <email>james.gentileg@twosixlabs.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>SIFT</institution>
          ,
          <addr-line>Minneapolis, MN</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Two Six Labs</institution>
          ,
          <addr-line>Arlington, VA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <issue>2173</issue>
      <abstract>
        <p>The replicability of research is crucial for building trust in the peer review process and transitioning knowledge to realworld applications. While manual peer review excels in some regards, the variability of reviewer expertise, publication requirements, and research domains brings about uncertainty in the process. Replicability, in particular, is not necessarily a priority; this is evidenced by repeated failures in replication attempts such as the Psychology Reproducibility Project, where 61 of 100 replications fail. Improving human comprehension of decisive factors is crucial for integrating automated systems for replicability prediction into the review process. We develop a robust, automated method for semantic parsing, information extraction, and replication prediction that operates directly on PDFs. We introduce features that have not been explored in prior work, construct argument structures to guide understanding, and provide preliminary results for replication prediction.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The replicability of research is crucial for building trust in
the peer review process and for the transition of knowledge
to real-world applications. Unfortunately, current attempts
at replicating research show that many research papers do
not replicate, with 61 of 100 failing the Psychology
Reproducibility Project
        <xref ref-type="bibr" rid="ref34">(Open Science Collaboration et al. 2015)</xref>
        ,
7 of 18 failing laboratory economics experiments
        <xref ref-type="bibr" rid="ref4">(Camerer
et al. 2016)</xref>
        , 3 of 13 failing the Many Labs Replication
Project
        <xref ref-type="bibr" rid="ref22">(Klein et al. 2014)</xref>
        , and more.
      </p>
      <p>Currently, research is manually peer reviewed by a few
experts often donating their time via venues such as
conferences and journals. While manual peer review excels in
some regards, the variability of reviewer expertise,
publication requirements, and research domains brings about
multiple levels of uncertainty. Additionally, peer review does not
specifically attempt to identify the replicability of research,
and, despite the increasing amount of automated analysis
tools and replication prediction systems, there have been few
changes to the review process over the years.</p>
      <p>
        Determining replicability at review time is challenging
for a multitude of reasons: limited access to data, limited
reviewer time, inability to run new experiments,
misleading statistics
        <xref ref-type="bibr" rid="ref19 ref34">(Head et al. 2015)</xref>
        , and the myriad variables
that affect a reviewer’s perception of the research, such as
the readability of the explanations, clarity and detail of the
methodology, significance of the authors’ claims, etc. These
variables that determine replicability can have varying levels
of impact on the decision to accept a paper due to reviewer
bias, research domain, and prior standards for acceptance.
Not all acceptances of research are because it is replicable.
Mapping these variables to actual replication outcomes can
produce a less biased estimation of replicability.
      </p>
      <p>In this work, we develop a novel method for
understanding replicability given only a PDF of the research while
encapsulating a wider, more robust set of factors than prior
art. Using a combination of rule-based processing and
machine learning, we perform consistent semantic parsing,
feature extraction, and replicability classification.</p>
      <p>Our main contributions are as follows:</p>
    </sec>
    <sec id="sec-2">
      <title>Consistent text extraction</title>
    </sec>
    <sec id="sec-3">
      <title>Automated classification of semantic flow</title>
    </sec>
    <sec id="sec-4">
      <title>Multifaceted feature extraction</title>
    </sec>
    <sec id="sec-5">
      <title>Preliminary replication prediction results</title>
      <p>2</p>
      <sec id="sec-5-1">
        <title>Related Work</title>
        <p>The work related to our contributions are multiple-fold:
previous literature has attempted to achieve a similar goal of
predicting replicability, but there are a variety of methods
that are relevant to our pipeline that have not been used for
replicability prediction. We cover both aspects of that prior
work here.
2.1</p>
        <sec id="sec-5-1-1">
          <title>Predicting Replicability</title>
          <p>
            The replication crisis is repeatedly noted throughout
replicability literature
            <xref ref-type="bibr" rid="ref22 ref34 ref4">(Open Science Collaboration et al. 2015;
Klein et al. 2014; Camerer et al. 2016)</xref>
            . Because peer
review is currently an entirely manual process, a natural
consequence is the desire to automate the understanding of
replicability. An early attempt uses prediction markets to
determine a ”market price” for research studies, representing the
likelihood that those studies would replicate
            <xref ref-type="bibr" rid="ref12 ref34">(Dreber et al.
2015)</xref>
            . The prediction markets correctly predict 29/41 (71%)
replications, but this method still requires approximately 50
domain experts to participate in the market. This is an
impractical requirement for situations such as peer-reviewed
conferences that often have three reviewers per paper. In
(Altmejd et al. 2019), the authors attempt to predict the
replicability of research by gathering features from within
or about the research itself: this involves statistical design
properties such as sample size, effect size, and p-value; or
descriptive aspects, such as the number of citations, number
of authors, and how subjects are compensated. By
aggregating a dataset of 131 direct replications, they achieve
approximately 70% prediction accuracy with random forest models.
Although the feature extraction is still a manual process,
locating relevant features in a paper is a tractable problem for
an individual, which is a substantial improvement over the
the prediction markets.
            <xref ref-type="bibr" rid="ref44">(Yang, Youyou, and Uzzi 2020)</xref>
            take
the automation a step further, and they obtain a 69%
prediction accuracy by training on word embeddings of the
research manuscript’s text. We automatically extract features
of prior work, generate a new set of features, and estimate
replicability with higher accuracy.
2.2
          </p>
        </sec>
        <sec id="sec-5-1-2">
          <title>Natural Language Processing</title>
          <p>
            Whether attempting to obtain statistical test information or
to operate directly on text, natural language processing is
critical to the automation of manuscript featurization. A
crucial innovation in the realm of general purpose
natural language modeling is the use of models such as BERT
            <xref ref-type="bibr" rid="ref10">(Devlin et al. 2018)</xref>
            , which are pre-trained on large,
unsupervised corpora. Through a fine-tuning step, these models
transfer to new problems and domains. A particularly
relevant application is SciBERT
            <xref ref-type="bibr" rid="ref18 ref20 ref3">(Beltagy, Lo, and Cohan 2019)</xref>
            ,
which is pre-trained on scientific publications from
multiple domains. The authors show that this pre-training
significantly improves results on downstream tasks related to
scientific language. Recent work that focuses on scientific
articles leverages these models to identify entities
            <xref ref-type="bibr" rid="ref18 ref20">(Hakala
and Pyysalo 2019)</xref>
            , extract events and relationships
            <xref ref-type="bibr" rid="ref34 ref42">(Allen
et al. 2015; Valenzuela-Esca´rcega et al. 2018)</xref>
            , and relate
extracted events to domain models (Friedman et al. 2017).
Our work utilizes fine-tuning to create span-based
information extraction with a broader context that includes sample
sizes, experimental methodologies, excluded sample counts,
statistical tests, and more. Additionally, rather than
focusing on the findings and contributions of scientific articles,
we characterize methodologies, materials, confidence, and
replicability.
          </p>
          <p>3</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>Approach</title>
        <p>
          We use a multi-stage pipeline in order to modularize each
component of the extraction and prediction process. Each
component can be easily changed out as enhancements to
any components are developed, such as improvements in
PDF parsing, new rules in rule-based methods, and updates
for machine learning models. Figure 1 shows the flow of raw
PDFs through various components, leading to the output of
JSON files that are formatted with the article’s text and
associated features that can be used in downstream models. The
pipeline comprises several main components: PDF
extraction, semantic tagging, and information extraction.
3.1
Extracting text from a PDF and formatting it into
informative segments are necessary steps for employing natural
language approaches. We use Automator
          <xref ref-type="bibr" rid="ref43">(Waldie 2009)</xref>
          to run
the built-in PDF to RTF extraction tool. The RTF files
maintain formatting information, but we use the command line
utility textutil to convert the RTF files to HTML files, which
we find to be more amenable to rule-based processing. We
apply rules to the extraction because it is an erroneous
process that fails around artifacts such as tables, captions, or
footnotes. HTML representations are each parsed into a hash
map where the keys are content styles and the values are
all concatenated words and white-spaces of that style in the
order they appear. The main content string of the paper is
identified as the longest value, by character count, in this
hash map. The main content string is used for all subsequent
processing.
3.2
        </p>
        <sec id="sec-5-2-1">
          <title>Semantic Tagging</title>
          <p>
            A key element to understanding the structure of an argument
is the semantic context in which the argument is made. To
that end, we develop a machine learning model to annotate
paragraphs based on their content. This is similar to the
annotation work presented in (Chan et al. 2018),
            <xref ref-type="bibr" rid="ref18 ref20">(Huber and
Carenini 2019)</xref>
            , and
            <xref ref-type="bibr" rid="ref8">(Dasigi et al. 2017)</xref>
            . Here though, we
modify the annotation scheme to better match the problem
of information extraction for replication prediction. We
infer the discourse class for each sentence and perform an
averaging of outputs to obtain the final class. This yields the
following modified annotation scheme with 6 elements:
Introduction: Problem statement and paper structure.
Methodology: Specifics of the study, including
participants, materials, and models.
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Results: Experimental results and statistical tests.</title>
      <p>Discussion: Author’s interpretation of results and
implications for the findings.</p>
      <p>Research Practice: Conflicts of interest, funding sources,
and acknowledgements.</p>
    </sec>
    <sec id="sec-7">
      <title>Reference: Citations</title>
      <p>Annotating Training Data In order to create a training
set for discourse class prediction, we extract text from 838
social and behavioral science (SBS) research articles. In
addition to the full text, these extractions contain the section
header. This is what we use as our annotation, resulting in
81,001 labeled sentences. Due to the variation in section
header names as a result of domain, tradition, or personal
preference, we assign a set of keywords to each discourse
class and label a section/segment of text if the section header
is grammatically close to a keyword. The keywords used to
create the dataset are:</p>
      <p>
        Introduction: fIntroductiong
Methodology: fMethodology, Analysis, Experiment,
Method, Procedure, Design, Material, Participantg
Results: fResultsg
Discussion: fDiscussion, Conclusiong
Research Practices: fAcknowledgements, Funding,
Ethics Statement, Competing Interests, Ethical Approvalg
Reference: fReference, Bibliographyg
Creating Semantic Vector Representations Given a
sentence extracted from a research article, we use the
Universal Sentence Encoder model
        <xref ref-type="bibr" rid="ref5">(Cer et al. 2018)</xref>
        , which is
designed to embed words, sentences, and small paragraphs into
a semantically-related latent space. We represent a sentences
as a 512-dimensional vector that encode the general
semantics and context.
      </p>
      <p>We use that 512-dimensional vector as input to a
fullyconnected hidden layer of size 512, followed by another
fullconnected hidden layer of size 256, followed by an output
layer of size 6 (representing the discourse classes). A
softmax activation after the output layer provides the discourse
prediction. We use 50% dropout between the layers and a
balanced sampling scheme to avoid overfitting to a single
class. We use precision and recall to evaluate the prediction
performance, shown in table 2.</p>
      <p>
        The following is an example input whose actual section
header is: 2.5 Inference from
        <xref ref-type="bibr" rid="ref7">(Cohan et al. 2020)</xref>
        . Our
model predicts the title: Methodology.
      </p>
      <p>At inference time, the model receives one paper, P, and
it outputs the SPECTER’s Transformer pooled output
activation as the paper representation for P (Equation 1). We
note that for inference, SPECTER requires only the title and
abstract of the given input paper; the model does not need
any citation information about the input paper. This means
that SPECTER can produce embeddings even for new
pa</p>
    </sec>
    <sec id="sec-8">
      <title>Introduction</title>
      <p>Methodology
Results
Discussion
Research Practices
Reference
pers that have yet to be cited, which is critical for
applications that target recent scientific paper
In addition to the content and context extracted from the
article, we further include features unrelated to the structure
of the paper, but that are essential to the analysis of a
paper’s claims. These include both natural language features
and statistical test results.</p>
      <p>
        Language Quality Regardless of the validity of a paper’s
methodology and analysis, a failure to adequately
communicate that information hinders others from using or
replicating that research. As a means of assessing the quality of the
writing itself, we compute three metrics over each paragraph
in the text: readability, subjectivity, and sentiment. The idea
for readability being related to the ability to reproduce
findings is generated from the discussion in
        <xref ref-type="bibr" rid="ref36">(Plave´n-Sigray et al.
2017)</xref>
        . We consider subjectivity due to discussions with
social and behavioral science domain experts about inferring
possible questionable research practices. Finally, positive or
negative sentiment in the results or discussion sections may
indicate biases towards the outcomes of the research.
Although any one of these features may not directly express
replicability, they do provide a holistic view of the writing.
      </p>
      <p>
        Using each paragraph as input, we compute readability
using Flesch Readability Ease
        <xref ref-type="bibr" rid="ref21">(Kincaid et al. 1975)</xref>
        ,
sentiment using the AllenNLP suite (Gardner et al. 2017), and
subjectivity using the TextBlob package
        <xref ref-type="bibr" rid="ref29">(Loria 2018)</xref>
        . This
produces a distribution of these features over the text that we
can relate to discourse, experimental results (statistics), and
domain-specific extractions.
3.4
      </p>
      <sec id="sec-8-1">
        <title>Methodological Information Extraction</title>
        <p>The unstructured prose of scientific documents includes key
features for assessing replicability, such as sample sizes,
populations, conditions, experimental variables, methods,
materials, exclusion criteria, and participant compensation.
Much of this information is available as concise spans of
text in the document: “twenty-four” may be a sample size;
“undergraduates” may be a population description;
“reaction time” may be a dependent variable; and so on.
Consequently, we are not interested in extracting and classifying
relations at this phase of analyses; rather, we optimize our
information extractor to classify individual spans within the
text with context-sensitive labels.</p>
        <p>Our dataset includes 620 labeled examples that are
annotated with the following properties:</p>
        <p>Sample count: How many elements are in the sample.
Sample noun: Noun phrases referring to sample
elements, e.g., students, participants, cases, etc.</p>
        <p>Sample detail: Details of the same, e.g., race, sex, age,
community, university, AMT, etc.</p>
        <p>Compensation: How participants are compensated.
Exclusion count: Number excluded from sample.
Exclusion reason: Stated reason(s) for why elements are
excluded from the final sample set.</p>
        <p>Experiment reference: Name or reference to an
experiment within the document.</p>
        <p>Experimental condition: Named or unnamed control or
experimental condition employed.</p>
      </sec>
      <sec id="sec-8-2">
        <title>Experimental variable/factor: Elements measured or re</title>
        <p>ported in the document, e.g., reaction time, participant
preference, accuracy on a task.</p>
        <p>Method or material: experimental methods or materials
employed, e.g., ANOVA, questionnaire, priming.</p>
        <p>We extract these features using a transformer-based,
token-level classifier that processes each sentence
separately. The output of the classifier model is a
Begin/Inside/Outside (BIO) prediction for each token in a
sentence. This assumes that no labels overlap in the sentence,
which is one constraint of our dataset.</p>
        <p>We illustrate the above labels as predicted on some
typical sentences from research articles in the SBS literature.
Figure 2 shows our model’s information extraction results
for a typical statement introducing a population and
sample size. This tags the English spans for sample count “one
hundred and ninety - seven,” sample noun “individuals,”
details (i.e., age mean, SD, gender, and AMT), an experiment
reference to “this study,” and the compensation of “$1.” In
another paper, Figure 3 identifies the number of sample
elements excluded, along with the resulting sample number and
gender details. Finally, Figure 4 shows a sentence from the
summary of an article, tagging “sem” (Structural Equation
Modeling) as a methodology, sample noun and details, and
five experimental factors that are assessed in the paper.</p>
        <p>Our model next processes the resulting classified spans –
as shown in Figures 2, 3, and 4 – to opportunistically
extract domain-specific numerical and Boolean features. For
example, the sample count and exclusion count are both
expected to be integers, so it attempts to coerce “one hundred
and ninety - seven” (Figure 2) and “Eight” (Figure 3) to
integers and populate corresponding integer features.
Similarly, the model uses a lexicon-based approach over the
sample descriptor spans to populate Boolean features
indicating whether participants’ genders, age, race, religion,
and community are specified, what the recruitment pool is
(e.g., AMT, universities, etc.), and how they are
compensated (e.g., course credit, monetary, etc.). These numerical,
Boolean, and lexical features populate the argument
structure of the paper, which we describe in subsequent sections.</p>
        <p>We train a model by fine-tuning SciBERT and DistilBERT
uncased models, and we evaluate using the same 558/62
randomized train/test split of our 620 labeled examples. Table 3
shows the results across four different transformer models
for 100 iterations each, showing best performance from the
SciBERT uncased model. While our model shows favorable
results for our relatively small dataset of 620 examples, we
are presently extending our dataset.</p>
        <p>One limitation of the present sentence-level analysis is
that cross-sentence coreferring expressions are unresolvable
within the model, although – since we are not extracting
complex relations across entities – most context-sensitive
concepts such as sample-size and exclusion-count have
ample context within the sentence itself. We plan to quantify
the benefit of adding cross-sentence coreference resolution
in future work.</p>
      </sec>
      <sec id="sec-8-3">
        <title>3.5 Statistical Test Extraction</title>
        <p>The descriptions of statistical tests in scientific documents
are much more structured than descriptions of samples,
methods, and factors. Consequently, our system uses Python
regular expressions (rather than a transformer-based model)
to extract statistical tests, motivated by processing speed and
tailorability. Our regular expressions identify 25 different
statistical tests and values, including p, R, R2, d, F-tests,
T-tests, mean, median, standard deviation, confidence
intervals, odds ratios, non-significance, and more. These regular
expressions were implemented for this system and were not
reused from a previous system.</p>
        <p>Our statistical test extractor then clusters extracted
elements by proximity: Figure 5 shows two statistical tests
(F (1; 27) = 3:37 and p = :08) that correspond to the same
result. The system expresses each sub-test (e.g., F-test and
p-test) as a separate sub-test leaf of the overall statistical
test. Each sub-test describes distinctive features; for
example, the p-test includes a value and an ordinal feature with
the value “=” since the authors reported equality instead of
“&lt;” or “&gt;,” and the F-test includes two degrees of freedom
and a value. Clustering these statistical tests into subgraphs
helps identify duplicate reports of experimental results, and
it provides context for downstream graphical analysis and</p>
        <p>FTest
node_id n-17cd5626
isa Subtest, FTest
name FTest
df_val1 1.0
df_val2 27.0
val 3.37</p>
        <p>PTest
node_id n-16cdd8b4
isa Subtest, PTest
name PTest
ordinal =
val 0.08
subtestOf
subtestOf</p>
        <p>F(1, 27) = 3.37, p = .08
node_id n-e7268e6d
isa StatTest
name F(1, 27) = 3.37, p = .08
testOf</p>
        <p>PTest</p>
      </sec>
      <sec id="sec-8-4">
        <title>3.6 Anosdse_eidmbn-2l3y7bi1nf09to Argument Structure</title>
        <p>isa Subtest, PTest
name PTest
After exotrrdaincalting individual spans and subgraphs from the
un&lt;
structuredvalprose0.0o0f1 a scientific article, we assemble the
extracted information into a global graph that we refer to as the
argument struFTcetsutre of the document. As implied by its name,
the argunomdee_indt ns-tcrc4u6cb8t4u4re is designed Ft(o1, 5e3x) p=r4.e2s6,sp t&lt;h.0e5 premises,
evidence,isaandSuobtbesst,eFrTvesattions</p>
        <p>subtestiOnf a nsocdei_eindtificna-3r5taibc5lbe7d, ultimately
in suppodnfr_atvmaoel1f itsFc1T.oe0sntclusions. naismae F(1, 53)S=tat4T.2es6t, p &lt; .05</p>
        <p>The sdyf_svatle2m ge53n.0erates the argument structure by iter atetsitOnfg
over the vsaelquen4c.e26 of tesxubttessteOgfments and associated
semantic tags (see Table 2 for a list of tags). Upon encountering
a transition iPnTesstemantic tags, such as a new Methodology
sectionnaodfet_eidr an-dc757884</p>
        <p>isa SubDtesits,PcTuesstsion section, the system instantiates a
new StundamyenodePTwesitthin its argument structure, and then adds
the BERorTdi-nealxtract&lt;ed features (see above) and statistical test
subgraphvsal (see a0.b05ove) as constituents of the new node. In
this fashion, the system accumulates nodes for Introduction,
Study, and DFiTscesutssion prose. A small set of features for two
Study nnoodde_eids fnr-4o92d51f2</p>
        <p>m the same paperF(1a,r2e7) =sh13o.0w2, np&lt;i.n001Figure 6,
populatendaismabeySiunbtfFeosTte,rsFmtTeasttiosnubteesxtOtfracnotdiisoea_nid. n-Sctea4tTfde7s8t2</p>
        <p>The gdfr_avapl1h-bas1e.0d layout of thenamaerguFm(1,e27n)t= s13tr.0u2,cpt&lt;ur.0e01altleostwOfs
the systdef_mval2to as27s.e0ss independent replicability concerns in
val 13.02
a context-sensitive, ex psulbateisntOafble fashion. For example, as
shown in FigPuTreest 6, the sample size of 24 for the study node
at left mnoadye_iidmnp-4a2c81t3t8haee judgment of that study’s replicability,
but it doeissa noStubnteesct, ePTsessatrily impact the replicability judgment
of the stnuamdey at PrTigeshtt, in the same paper. Likewise
specifying the poradirntailcipan&lt;ts’ race in Figure 6 (left) may improve the
replicabilviatly jud0g.0m01ent of that study but should not affect the
other study that does not specify participant race.</p>
        <p>FTest
Eachnnodoe_dide inn-8t7h70e8adb9irected argument structure graph is
conF(1, 26) = 2.75, p = .10
nected diirseactlSyubotesrt,iFnTdesitrectly to thneodne_oidde representing the
scisubtestOf n-eee2aebb
entific anratmicele itsFTeelsft. In this fashioisna, the arSgtautTmesetnt structure
is a fulldyf_-vcalo1nnected graph that snuampepoFr(t1s, 2g6)r=a2p.7h5, pa=n.d10 pattern
1.0
matchindgf_,val2 26.0 testOf
valconfid2e.7n5ce propagation, and feature extraction in
subtestOf
order to judge and explain replicability.
We train a random forest model to classify the
replicability of of papers. Our dataset is a collection of papers from
the Journal of Experimental Psychology, and we are able
to mostly separate the replicable and non-replicable
experiments. We plan to improve that separation and the
calibration of the replicability scores in future work.</p>
        <p>Ground Truth Replications To evaluate the ability of our
model to correctly separate replicated studies from those that
did not replicate, we train and test on replication attempts
for the papers from the Journal of Experimental
Psychology. We are able to collect approximately 150 PDFs of these
papers for parsing and processing. As the replication studies
are performed by different groups, there is variability in the
number of features available in the given data. Many contain
simple statistics such as sample size, but only a few contain
p-value. To expand the set of available features, we
manually mine them from the parsed PDFs. This gives features
related to the number and significance of p-values reported,
a proxy to the number of figures present, the presence of
effect size, and the presence of an appendix. Furthermore, we
judge replicability based on the percentage of known
replications to known failures (e.g. in a set of replication studies,
if an experiment was replicated 5 times and failed to
replicate 3 times, we say the experiment replicated).</p>
        <p>Prediction Model &amp; Results We train a binary random
forest classifier in a similar fashion to (Altmejd et al. 2019)
to predict the replicability of an experiment. We use 5000
estimators with a max depth of 3. We evaluate our
performance with AUC and accuracy, shown in Table 4. We select
11 psychology papers from the dataset and use these as the
evaluation set. We predict using experiment p-value and the
presence of effect size (binary). The results for the
individual papers are shown in Table 5.
4</p>
        <sec id="sec-8-4-1">
          <title>Discussion</title>
          <p>Targeting replicability in the evaluation of research is a
diverse task that is often not prioritized during peer review.
Improving human comprehension of decisive factors is a
crucial push towards integrating automated systems for
replicability prediction into the review process. In this work, we
develop an automated system for identifying, extracting, and
organizing those factors. We introduce measures of language
quality such as subjectivity, sentiment, and readability; we
semantically tag text in order to understand language
context; we extract statistical test information, linguistic
relationships, and methodologies; and then we construct a
hierarchical argument structure and perform replicability
classification. These factors and their organization are intuitive to
readers and allow for both top-down and bottom-up
understanding of a paper’s methods. Although leaving the review
process entirely up to automation is not feasible,
humanin-the-loop systems that guide reviewers through important
text, factors, and predictions can reduce the amount of
nonreplicable papers that make it through review.</p>
          <p>5</p>
        </sec>
        <sec id="sec-8-4-2">
          <title>Future Work</title>
          <p>One of the main focuses of our future work is to extend
our ground truth datasets and evaluate replicability
prediction across the combinations of features that we develop in
the current work. Due to the limited data size, the evaluation
set is too small to definitively select the best combination of
features for replicability prediction.</p>
          <p>
            We are also working to broaden our system’s
features and capabilities. For instance, we are incorporating
a transformer-based information extractor that extracts the
causal, proportional, and comparative relationships in
scientific claims
            <xref ref-type="bibr" rid="ref30">(Magnusson and Friedman 2021)</xref>
            to relate the
claims within and across scientific documents in our
corpus. To improve human interpretation, we are working to
produce an explainability interface for users to inspect our
extractions, predictions, and argument structure for guided
paper understanding.
          </p>
          <p>
            We will further assess the validity of the elements of a
paper, i.e., the confidence that we have in the claims in the
paper given the assumptions made by it, by using an
existing probabilistic inference and recognition system, SUNNY,
originally developed for planning
            <xref ref-type="bibr" rid="ref25">(Kuter et al. 2004)</xref>
            and
social network analysis
            <xref ref-type="bibr" rid="ref23 ref24">(Kuter and Golbeck 2007, 2010)</xref>
            .
SUNNY propagates local estimates of uncertainty through
large models. Its most basic output is the probability, as a
function of time, that a particular event will be true. We have
extended SUNNY for k-nearest neighbors (kNN) learning
and prediction capabilities, as well as a Naive Bayes
diagnoses of confidence scores based on the (kNN) clustering.
The numeric and qualitative features in our argument
structures form the basis of the kNN clustering and we will
extend these measures towards predicting replicability scores
in SUNNY in the near future.
          </p>
        </sec>
        <sec id="sec-8-4-3">
          <title>Acknowledgements</title>
          <p>This material is based upon work supported by the Defense
Advanced Research Projects Agency (DARPA) and Army
Research Office (ARO) under Contract No.
W911NF-20C-0002. Any opinions, findings, and conclusions or
recommendations expressed in this material are those of the
author(s) and do not necessarily reflect the views of the
Defense Advanced Research Projects Agency (DARPA) and
Army Research Office (ARO).</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          2015.
          <article-title>Complex event extraction using DRUM</article-title>
          .
          <source>Technical report, Florida Institute for Human and Machine Cognition Pensacola United States.</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2019.
          <article-title>Predicting the replicability of social science lab experiments</article-title>
          .
          <source>PloS one</source>
          <volume>14</volume>
          (
          <issue>12</issue>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Beltagy</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ; Lo,
          <string-name>
            <given-names>K.</given-names>
            ; and
            <surname>Cohan</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <year>2019</year>
          .
          <article-title>SciBERT: A pretrained language model for scientific text</article-title>
          . arXiv preprint arXiv:
          <year>1903</year>
          .10676 .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Camerer</surname>
            ,
            <given-names>C. F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Dreber</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Forsell</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Ho</surname>
          </string-name>
          , T.-H.;
          <string-name>
            <surname>Huber</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Johannesson,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Kirchler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Almenberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ;
            <surname>Altmejd</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ;
            <surname>Chan</surname>
          </string-name>
          ,
          <string-name>
            <surname>T.</surname>
          </string-name>
          ; et al.
          <year>2016</year>
          .
          <article-title>Evaluating replicability of laboratory experiments in economics</article-title>
          .
          <source>Science</source>
          <volume>351</volume>
          (
          <issue>6280</issue>
          ):
          <fpage>1433</fpage>
          -
          <lpage>1436</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Cer</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Kong</surname>
          </string-name>
          , S.-y.;
          <string-name>
            <surname>Hua</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Limtiaco</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ; John, R. S.; Constant,
          <string-name>
            <given-names>N.</given-names>
            ;
            <surname>Guajardo-Cespedes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ;
            <surname>Tar</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          ; et al.
          <year>2018</year>
          .
          <article-title>Universal sentence encoder</article-title>
          . arXiv preprint arXiv:
          <year>1803</year>
          .11175 .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          2018.
          <article-title>Solvent: A mixed initiative system for finding analogies between research papers</article-title>
          .
          <source>Proceedings of the ACM on Human-Computer Interaction 2(CSCW)</source>
          :
          <fpage>1</fpage>
          -
          <lpage>21</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Cohan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Feldman</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Beltagy</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Downey</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Weld</surname>
            ,
            <given-names>D. S.</given-names>
          </string-name>
          <year>2020</year>
          .
          <article-title>Specter: Document-level representation learning using citation-informed transformers</article-title>
          .
          <source>In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics</source>
          ,
          <fpage>2270</fpage>
          -
          <lpage>2282</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Dasigi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ; Burns,
          <string-name>
            <given-names>G. A.</given-names>
            ;
            <surname>Hovy</surname>
          </string-name>
          , E.; and
          <string-name>
            <surname>de Waard</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <source>arXiv preprint arXiv:1702</source>
          .
          <fpage>05398</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Chang, M.-W.;
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <article-title>Bert: Pre-training of deep bidirectional transformers for language understanding</article-title>
          . arXiv preprint arXiv:
          <year>1810</year>
          .04805 .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Dreber</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Pfeiffer</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Almenberg</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Isaksson,
          <string-name>
            <surname>S.</surname>
          </string-name>
          ; Wilson,
          <string-name>
            <given-names>B.</given-names>
            ;
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            ;
            <surname>Nosek</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. A.</surname>
          </string-name>
          ; and Johannesson,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <article-title>Using prediction markets to estimate the reproducibility of scientific research</article-title>
          .
          <source>Proceedings of the National Academy of Sciences</source>
          <volume>112</volume>
          (
          <issue>50</issue>
          ):
          <fpage>15343</fpage>
          -
          <lpage>15347</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Fischer</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ; Greitemeyer,
          <string-name>
            <given-names>T.</given-names>
            ; and
            <surname>Frey</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          <year>2008</year>
          .
          <article-title>Selfregulation and selective exposure: the impact of depleted self-regulation resources on confirmatory information processing</article-title>
          .
          <source>Journal of personality and social psychology 94</source>
          <volume>(3)</volume>
          :
          <fpage>382</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          2017.
          <article-title>Learning by reading: Extending and localizing against a model</article-title>
          .
          <source>Advances in Cognitive Systems</source>
          <volume>5</volume>
          :
          <fpage>77</fpage>
          -
          <lpage>96</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>2017. AllenNLP: A Deep Semantic Natural Language Processing Platform</surname>
          </string-name>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <surname>Goff</surname>
            ,
            <given-names>P. A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Steele</surname>
            ,
            <given-names>C. M.</given-names>
          </string-name>
          ; and Davies,
          <string-name>
            <surname>P. G.</surname>
          </string-name>
          <year>2008</year>
          .
          <article-title>The space between us: stereotype threat and distance in interracial contexts</article-title>
          .
          <source>Journal of personality and social psychology 94</source>
          <volume>(1)</volume>
          :
          <fpage>91</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <surname>Hakala</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Pyysalo</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>Biomedical named entity recognition with multilingual BERT</article-title>
          .
          <source>In Proceedings of The 5th Workshop on BioNLP Open Shared Tasks</source>
          ,
          <fpage>56</fpage>
          -
          <lpage>61</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>Head</surname>
            ,
            <given-names>M. L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Holman</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Lanfear</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Kahn</surname>
            ,
            <given-names>A. T.</given-names>
          </string-name>
          ; and Jennions,
          <string-name>
            <surname>M. D.</surname>
          </string-name>
          <year>2015</year>
          .
          <article-title>The extent and consequences of phacking in science</article-title>
          .
          <source>PLoS Biol</source>
          <volume>13</volume>
          (
          <issue>3</issue>
          ):
          <fpage>e1002106</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <surname>Huber</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ; and Carenini,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <year>2019</year>
          .
          <article-title>Predicting discourse structure using distant supervision from sentiment</article-title>
          . arXiv preprint arXiv:
          <year>1910</year>
          .14176 .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>Kincaid</surname>
            ,
            <given-names>J. P.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Fishburne</surname>
            <given-names>Jr</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>R. P.</given-names>
            ;
            <surname>Rogers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. L.</given-names>
            ; and
            <surname>Chissom</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. S.</surname>
          </string-name>
          <year>1975</year>
          .
          <article-title>Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel</article-title>
          .
          <source>Technical report, Naval Technical Training Command Millington TN Research Branch.</source>
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <surname>Klein</surname>
            ,
            <given-names>R. A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Ratliff</surname>
            ,
            <given-names>K. A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Vianello</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Adams</surname>
            <given-names>Jr</given-names>
          </string-name>
          , R. B.; Bahn´ık, Sˇ .; Bernstein,
          <string-name>
            <given-names>M. J.</given-names>
            ;
            <surname>Bocian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ;
            <surname>Brandt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            ;
            <surname>Brooks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ;
            <surname>Brumbaugh</surname>
          </string-name>
          ,
          <string-name>
            <surname>C. C.</surname>
          </string-name>
          ; et al.
          <year>2014</year>
          .
          <article-title>Investigating variation in replicability</article-title>
          .
          <source>Social psychology .</source>
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <surname>Kuter</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Golbeck</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2007</year>
          .
          <article-title>Sunny: A new algorithm for trust inference in social networks using probabilistic confidence models</article-title>
          .
          <source>In AAAI.</source>
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <surname>Kuter</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Golbeck</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2010</year>
          .
          <article-title>Using Probabilistic Confidence Models for Trust Inference in Web-Based Social Networks</article-title>
          .
          <source>Transactions on Internet Technology (TOIT)</source>
          <volume>7</volume>
          :
          <fpage>1377</fpage>
          -
          <lpage>1382</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <surname>Kuter</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Nau</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Gossink</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Lemmer</surname>
            ,
            <given-names>J. F.</given-names>
          </string-name>
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <article-title>Interactive Course-of-Action Planning Using Causal Models</article-title>
          .
          <source>In International Conference on Knowledge Systems for Coalition Operations (KSCO-2004)</source>
          ,
          <fpage>37</fpage>
          -
          <lpage>52</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <given-names>Lemay</given-names>
            <surname>Jr</surname>
          </string-name>
          ,
          <string-name>
            <surname>E. P.</surname>
          </string-name>
          ; and Clark,
          <string-name>
            <surname>M. S.</surname>
          </string-name>
          <year>2008a</year>
          . ”
          <article-title>Walking on eggshells”: how expressing relationship insecurities perpetuates them</article-title>
          .
          <source>Journal of personality and social psychology 95</source>
          <volume>(2)</volume>
          :
          <fpage>420</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <string-name>
            <given-names>Lemay</given-names>
            <surname>Jr</surname>
          </string-name>
          ,
          <string-name>
            <surname>E. P.</surname>
          </string-name>
          ; and Clark,
          <string-name>
            <surname>M. S.</surname>
          </string-name>
          <year>2008b</year>
          .
          <article-title>How the head liberates the heart: Projection of communal responsiveness guides relationship promotion</article-title>
          .
          <source>Journal of Personality and Social Psychology</source>
          <volume>94</volume>
          (
          <issue>4</issue>
          ):
          <fpage>647</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <string-name>
            <surname>Loria</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2018</year>
          . textblob Documentation.
          <source>Release 0.15 2.</source>
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <string-name>
            <surname>Magnusson</surname>
            ,
            <given-names>I. H.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Friedman</surname>
            ,
            <given-names>S. E.</given-names>
          </string-name>
          <year>2021</year>
          .
          <article-title>Graph Knowledge Extraction of Causal, Comparative, Predictive, and Proportional Associations in Scientific Claims with a Transformer-Based Model</article-title>
          .
          <source>In AAAI Workshop on Scientific Document Understanding.</source>
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <string-name>
            <surname>Monin</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Sawyer</surname>
            ,
            <given-names>P. J.;</given-names>
          </string-name>
          and Marquez,
          <string-name>
            <surname>M. J.</surname>
          </string-name>
          <year>2008</year>
          .
          <article-title>The rejection of moral rebels: resenting those who do the right thing</article-title>
          .
          <source>Journal of personality and social psychology 95</source>
          <volume>(1)</volume>
          :
          <fpage>76</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <string-name>
            <surname>Nosek</surname>
            ,
            <given-names>B. A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Banaji</surname>
            ,
            <given-names>M. R.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Greenwald</surname>
            ,
            <given-names>A. G.</given-names>
          </string-name>
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <article-title>Math= male, me= female, therefore math6= me</article-title>
          .
          <source>Journal of personality and social psychology 83</source>
          <volume>(1)</volume>
          :
          <fpage>44</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          <string-name>
            <surname>Open Science Collaboration</surname>
          </string-name>
          ; et al.
          <year>2015</year>
          .
          <article-title>Estimating the reproducibility of psychological science</article-title>
          .
          <source>Science</source>
          <volume>349</volume>
          (
          <issue>6251</issue>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          <string-name>
            <surname>Payne</surname>
            ,
            <given-names>B. K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Burkley</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          ; and Stokes,
          <string-name>
            <surname>M. B.</surname>
          </string-name>
          <year>2008</year>
          .
          <article-title>Why do implicit and explicit attitude tests diverge? The role of structural fit</article-title>
          .
          <source>Journal of personality and social psychology 94</source>
          <volume>(1)</volume>
          :
          <fpage>16</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          <article-title>Plave´n-</article-title>
          <string-name>
            <surname>Sigray</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ; Matheson,
          <string-name>
            <given-names>G. J.</given-names>
            ;
            <surname>Schiffler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. C.</given-names>
            ; and
            <surname>Thompson</surname>
          </string-name>
          ,
          <string-name>
            <surname>W. H.</surname>
          </string-name>
          <year>2017</year>
          .
          <article-title>The readability of scientific texts is decreasing over time</article-title>
          .
          <source>Elife</source>
          <volume>6</volume>
          :
          <fpage>e27725</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          <string-name>
            <surname>Purdie-Vaughns</surname>
          </string-name>
          , V.;
          <string-name>
            <surname>Steele</surname>
            ,
            <given-names>C. M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Davies</surname>
            ,
            <given-names>P. G.</given-names>
          </string-name>
          ; Ditlmann, R.; and
          <string-name>
            <surname>Crosby</surname>
            ,
            <given-names>J. R.</given-names>
          </string-name>
          <year>2008</year>
          .
          <article-title>Social identity contingencies: how diversity cues signal threat or safety for African Americans in mainstream institutions</article-title>
          .
          <source>Journal of personality and social psychology 94</source>
          <volume>(4)</volume>
          :
          <fpage>615</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          <string-name>
            <surname>Shnabel</surname>
          </string-name>
          , N.; and
          <string-name>
            <surname>Nadler</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2008</year>
          .
          <article-title>A needs-based model of reconciliation: satisfying the differential emotional needs of victim and perpetrator as a key to promoting reconciliation.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          <source>Journal of personality and social psychology 94</source>
          <volume>(1)</volume>
          :
          <fpage>116</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          <string-name>
            <surname>Soto</surname>
            ,
            <given-names>C. J.</given-names>
          </string-name>
          ; John,
          <string-name>
            <given-names>O. P.</given-names>
            ;
            <surname>Gosling</surname>
          </string-name>
          , S. D.; and
          <string-name>
            <surname>Potter</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          <article-title>The developmental psychometrics of big five self-reports: Acquiescence, factor structure, coherence, and differentiation from ages 10 to 20</article-title>
          .
          <article-title>Journal of personality and social psychology 94(4</article-title>
          ):
          <fpage>718</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          <string-name>
            <surname>Valenzuela-Esca</surname>
          </string-name>
          ´rcega, M. A.;
          <string-name>
            <surname>Babur</surname>
            ,
            <given-names>O</given-names>
          </string-name>
          ¨ .;
          <string-name>
            <surname>Hahn-Powell</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Bell</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ; Hicks,
          <string-name>
            <given-names>T.</given-names>
            ;
            <surname>Noriega-Atala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            ;
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            ;
            <surname>Surdeanu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Demir</surname>
          </string-name>
          , E.; and
          <string-name>
            <surname>Morrison</surname>
            ,
            <given-names>C. T.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Large-scale automated machine reading discovers new cancer-driving mechanisms</article-title>
          .
          <source>Database</source>
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          <string-name>
            <surname>Waldie</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <year>2009</year>
          .
          <article-title>Automator for Mac OS X 10.6 Snow Leopard: Visual QuickStart Guide</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Youyou</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Uzzi</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <year>2020</year>
          .
          <article-title>Estimating the deep replicability of scientific findings using human and artificial intelligence</article-title>
          .
          <source>Proceedings of the National Academy of Sciences</source>
          <volume>117</volume>
          (
          <issue>20</issue>
          ):
          <fpage>10762</fpage>
          -
          <lpage>10768</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>