Toward A Robust Method for Understanding the Replicability of Research
                 Ben Gelman,1 Chae Clark,1 Scott Friedman,2 Ugur Kuter,2 James Gentile1
                                                   1
                                               Two Six Labs, Arlington, VA, USA
                                                2
                                                  SIFT, Minneapolis, MN, USA
                     {ben.gelman, chae.clark, james.gentile}@twosixlabs.com, {friedman, ukuter}@sift.net


                            Abstract                                 reviewer time, inability to run new experiments, mislead-
                                                                     ing statistics (Head et al. 2015), and the myriad variables
  The replicability of research is crucial for building trust in     that affect a reviewer’s perception of the research, such as
  the peer review process and transitioning knowledge to real-
  world applications. While manual peer review excels in some
                                                                     the readability of the explanations, clarity and detail of the
  regards, the variability of reviewer expertise, publication re-    methodology, significance of the authors’ claims, etc. These
  quirements, and research domains brings about uncertainty          variables that determine replicability can have varying levels
  in the process. Replicability, in particular, is not necessarily   of impact on the decision to accept a paper due to reviewer
  a priority; this is evidenced by repeated failures in replica-     bias, research domain, and prior standards for acceptance.
  tion attempts such as the Psychology Reproducibility Project,      Not all acceptances of research are because it is replicable.
  where 61 of 100 replications fail. Improving human com-            Mapping these variables to actual replication outcomes can
  prehension of decisive factors is crucial for integrating au-      produce a less biased estimation of replicability.
  tomated systems for replicability prediction into the review          In this work, we develop a novel method for understand-
  process. We develop a robust, automated method for seman-
                                                                     ing replicability given only a PDF of the research while en-
  tic parsing, information extraction, and replication prediction
  that operates directly on PDFs. We introduce features that         capsulating a wider, more robust set of factors than prior
  have not been explored in prior work, construct argument           art. Using a combination of rule-based processing and ma-
  structures to guide understanding, and provide preliminary         chine learning, we perform consistent semantic parsing, fea-
  results for replication prediction.                                ture extraction, and replicability classification.
                                                                        Our main contributions are as follows:
                     1     Introduction                              • Consistent text extraction
The replicability of research is crucial for building trust in       • Automated classification of semantic flow
the peer review process and for the transition of knowledge          • Multifaceted feature extraction
to real-world applications. Unfortunately, current attempts
at replicating research show that many research papers do            • Preliminary replication prediction results
not replicate, with 61 of 100 failing the Psychology Repro-
ducibility Project (Open Science Collaboration et al. 2015),                            2    Related Work
7 of 18 failing laboratory economics experiments (Camerer            The work related to our contributions are multiple-fold: pre-
et al. 2016), 3 of 13 failing the Many Labs Replication              vious literature has attempted to achieve a similar goal of
Project (Klein et al. 2014), and more.                               predicting replicability, but there are a variety of methods
   Currently, research is manually peer reviewed by a few            that are relevant to our pipeline that have not been used for
experts often donating their time via venues such as con-            replicability prediction. We cover both aspects of that prior
ferences and journals. While manual peer review excels in            work here.
some regards, the variability of reviewer expertise, publica-
tion requirements, and research domains brings about multi-          2.1   Predicting Replicability
ple levels of uncertainty. Additionally, peer review does not        The replication crisis is repeatedly noted throughout repli-
specifically attempt to identify the replicability of research,      cability literature (Open Science Collaboration et al. 2015;
and, despite the increasing amount of automated analysis             Klein et al. 2014; Camerer et al. 2016). Because peer re-
tools and replication prediction systems, there have been few        view is currently an entirely manual process, a natural conse-
changes to the review process over the years.                        quence is the desire to automate the understanding of repli-
   Determining replicability at review time is challenging           cability. An early attempt uses prediction markets to deter-
for a multitude of reasons: limited access to data, limited          mine a ”market price” for research studies, representing the
Copyright c 2021 for this paper by its authors. Use permitted un-    likelihood that those studies would replicate (Dreber et al.
der Creative Commons License Attribution 4.0 International (CC       2015). The prediction markets correctly predict 29/41 (71%)
BY 4.0).                                                             replications, but this method still requires approximately 50
domain experts to participate in the market. This is an im-       3.1   PDF Extraction
practical requirement for situations such as peer-reviewed        Extracting text from a PDF and formatting it into informa-
conferences that often have three reviewers per paper. In         tive segments are necessary steps for employing natural lan-
(Altmejd et al. 2019), the authors attempt to predict the         guage approaches. We use Automator (Waldie 2009) to run
replicability of research by gathering features from within       the built-in PDF to RTF extraction tool. The RTF files main-
or about the research itself: this involves statistical design    tain formatting information, but we use the command line
properties such as sample size, effect size, and p-value; or      utility textutil to convert the RTF files to HTML files, which
descriptive aspects, such as the number of citations, number      we find to be more amenable to rule-based processing. We
of authors, and how subjects are compensated. By aggregat-        apply rules to the extraction because it is an erroneous pro-
ing a dataset of 131 direct replications, they achieve approxi-   cess that fails around artifacts such as tables, captions, or
mately 70% prediction accuracy with random forest models.         footnotes. HTML representations are each parsed into a hash
Although the feature extraction is still a manual process, lo-    map where the keys are content styles and the values are
cating relevant features in a paper is a tractable problem for    all concatenated words and white-spaces of that style in the
an individual, which is a substantial improvement over the        order they appear. The main content string of the paper is
the prediction markets. (Yang, Youyou, and Uzzi 2020) take        identified as the longest value, by character count, in this
the automation a step further, and they obtain a 69% pre-         hash map. The main content string is used for all subsequent
diction accuracy by training on word embeddings of the re-        processing.
search manuscript’s text. We automatically extract features
of prior work, generate a new set of features, and estimate       3.2   Semantic Tagging
replicability with higher accuracy.                               A key element to understanding the structure of an argument
2.2   Natural Language Processing                                 is the semantic context in which the argument is made. To
                                                                  that end, we develop a machine learning model to annotate
Whether attempting to obtain statistical test information or      paragraphs based on their content. This is similar to the an-
to operate directly on text, natural language processing is       notation work presented in (Chan et al. 2018), (Huber and
critical to the automation of manuscript featurization. A         Carenini 2019), and (Dasigi et al. 2017). Here though, we
crucial innovation in the realm of general purpose natu-          modify the annotation scheme to better match the problem
ral language modeling is the use of models such as BERT           of information extraction for replication prediction. We in-
(Devlin et al. 2018), which are pre-trained on large, unsu-       fer the discourse class for each sentence and perform an av-
pervised corpora. Through a fine-tuning step, these models        eraging of outputs to obtain the final class. This yields the
transfer to new problems and domains. A particularly rele-        following modified annotation scheme with 6 elements:
vant application is SciBERT (Beltagy, Lo, and Cohan 2019),
which is pre-trained on scientific publications from multi-       • Introduction: Problem statement and paper structure.
ple domains. The authors show that this pre-training sig-         • Methodology: Specifics of the study, including partici-
nificantly improves results on downstream tasks related to           pants, materials, and models.
scientific language. Recent work that focuses on scientific       • Results: Experimental results and statistical tests.
articles leverages these models to identify entities (Hakala
and Pyysalo 2019), extract events and relationships (Allen        • Discussion: Author’s interpretation of results and impli-
et al. 2015; Valenzuela-Escárcega et al. 2018), and relate          cations for the findings.
extracted events to domain models (Friedman et al. 2017).         • Research Practice: Conflicts of interest, funding sources,
Our work utilizes fine-tuning to create span-based informa-          and acknowledgements.
tion extraction with a broader context that includes sample       • Reference: Citations
sizes, experimental methodologies, excluded sample counts,
statistical tests, and more. Additionally, rather than focus-     Annotating Training Data In order to create a training
ing on the findings and contributions of scientific articles,     set for discourse class prediction, we extract text from 838
we characterize methodologies, materials, confidence, and         social and behavioral science (SBS) research articles. In ad-
replicability.                                                    dition to the full text, these extractions contain the section
                                                                  header. This is what we use as our annotation, resulting in
                      3    Approach                               81,001 labeled sentences. Due to the variation in section
We use a multi-stage pipeline in order to modularize each         header names as a result of domain, tradition, or personal
component of the extraction and prediction process. Each          preference, we assign a set of keywords to each discourse
component can be easily changed out as enhancements to            class and label a section/segment of text if the section header
any components are developed, such as improvements in             is grammatically close to a keyword. The keywords used to
PDF parsing, new rules in rule-based methods, and updates         create the dataset are:
for machine learning models. Figure 1 shows the flow of raw       • Introduction: {Introduction}
PDFs through various components, leading to the output of         • Methodology: {Methodology, Analysis, Experiment,
JSON files that are formatted with the article’s text and asso-      Method, Procedure, Design, Material, Participant}
ciated features that can be used in downstream models. The
pipeline comprises several main components: PDF extrac-           • Results: {Results}
tion, semantic tagging, and information extraction.               • Discussion: {Discussion, Conclusion}
Figure 1: The full pipeline. We combine PDF extraction and rule-based parsing to generate strings of the research text, apply
machine learning-based semantic tagging, and then extract features with machine-learning and rule-based approaches to gener-
ate a single formatted JSON per paper. This formatted JSON is convenient for downstream models, such as argument structure
construction and replication prediction.


Table 1: The number of sentences per discourse tag extracted     Table 2: Precision/Recall/F1 results on a holdout set of an-
from the training data.                                          notated sentences.

          Discourse Tag         Sentence Count                                             Precision    Recall    F1-score
          Introduction              13,023                           Introduction            0.53        0.70       0.60
          Methodology               24,930                           Methodology             0.80        0.47       0.59
          Results                   18,308                           Results                 0.56        0.64       0.60
          Discussion                14,233                           Discussion              0.60        0.52       0.56
          Research Practices         353                             Research Practices      0.96        0.73       0.83
          Reference                 10,153                           Reference               0.75        0.97       0.84


• Research Practices: {Acknowledgements, Funding,                pers that have yet to be cited, which is critical for applica-
  Ethics Statement, Competing Interests, Ethical Approval}       tions that target recent scientific paper
• Reference: {Reference, Bibliography}
                                                                 3.3   Information Extraction
Creating Semantic Vector Representations Given a sen-            In addition to the content and context extracted from the ar-
tence extracted from a research article, we use the Univer-      ticle, we further include features unrelated to the structure
sal Sentence Encoder model (Cer et al. 2018), which is de-       of the paper, but that are essential to the analysis of a pa-
signed to embed words, sentences, and small paragraphs into      per’s claims. These include both natural language features
a semantically-related latent space. We represent a sentences    and statistical test results.
as a 512-dimensional vector that encode the general seman-
tics and context.                                                Language Quality Regardless of the validity of a paper’s
   We use that 512-dimensional vector as input to a fully-       methodology and analysis, a failure to adequately communi-
connected hidden layer of size 512, followed by another full-    cate that information hinders others from using or replicat-
connected hidden layer of size 256, followed by an output        ing that research. As a means of assessing the quality of the
layer of size 6 (representing the discourse classes). A soft-    writing itself, we compute three metrics over each paragraph
max activation after the output layer provides the discourse     in the text: readability, subjectivity, and sentiment. The idea
prediction. We use 50% dropout between the layers and a          for readability being related to the ability to reproduce find-
balanced sampling scheme to avoid overfitting to a single        ings is generated from the discussion in (Plavén-Sigray et al.
class. We use precision and recall to evaluate the prediction    2017). We consider subjectivity due to discussions with so-
performance, shown in table 2.                                   cial and behavioral science domain experts about inferring
   The following is an example input whose actual section        possible questionable research practices. Finally, positive or
header is: 2.5 Inference from (Cohan et al. 2020). Our           negative sentiment in the results or discussion sections may
model predicts the title: Methodology.                           indicate biases towards the outcomes of the research. Al-
   At inference time, the model receives one paper, P, and       though any one of these features may not directly express
it outputs the SPECTER’s Transformer pooled output acti-         replicability, they do provide a holistic view of the writing.
vation as the paper representation for P (Equation 1). We           Using each paragraph as input, we compute readability
note that for inference, SPECTER requires only the title and     using Flesch Readability Ease (Kincaid et al. 1975), senti-
abstract of the given input paper; the model does not need       ment using the AllenNLP suite (Gardner et al. 2017), and
any citation information about the input paper. This means       subjectivity using the TextBlob package (Loria 2018). This
that SPECTER can produce embeddings even for new pa-             produces a distribution of these features over the text that we
Figure 2: Labeling spans for sample size (samp num), sample details (samp detail), and subject compensation
(compensation) in a specific study (exper ref).


Figure 3: Labeling spans for the number of sample elements excluded (excl num) and the stated reason they were excluded
(excl reason), as well as the final sample number.


Figure 4: Labeling the sample, experimental methods employed (method or material), and factors (exper factor) under study.


can relate to discourse, experimental results (statistics), and   • Experimental variable/factor: Elements measured or re-
domain-specific extractions.                                         ported in the document, e.g., reaction time, participant
                                                                     preference, accuracy on a task.
3.4   Methodological Information Extraction                       • Method or material: experimental methods or materials
The unstructured prose of scientific documents includes key          employed, e.g., ANOVA, questionnaire, priming.
features for assessing replicability, such as sample sizes,
populations, conditions, experimental variables, methods,            We extract these features using a transformer-based,
materials, exclusion criteria, and participant compensation.      token-level classifier that processes each sentence sep-
Much of this information is available as concise spans of         arately. The output of the classifier model is a Be-
text in the document: “twenty-four” may be a sample size;         gin/Inside/Outside (BIO) prediction for each token in a sen-
“undergraduates” may be a population description; “reac-          tence. This assumes that no labels overlap in the sentence,
tion time” may be a dependent variable; and so on. Conse-         which is one constraint of our dataset.
quently, we are not interested in extracting and classifying         We illustrate the above labels as predicted on some typ-
relations at this phase of analyses; rather, we optimize our      ical sentences from research articles in the SBS literature.
information extractor to classify individual spans within the     Figure 2 shows our model’s information extraction results
text with context-sensitive labels.                               for a typical statement introducing a population and sam-
   Our dataset includes 620 labeled examples that are anno-       ple size. This tags the English spans for sample count “one
tated with the following properties:                              hundred and ninety - seven,” sample noun “individuals,” de-
                                                                  tails (i.e., age mean, SD, gender, and AMT), an experiment
• Sample count: How many elements are in the sample.              reference to “this study,” and the compensation of “$1.” In
• Sample noun: Noun phrases referring to sample ele-              another paper, Figure 3 identifies the number of sample ele-
   ments, e.g., students, participants, cases, etc.               ments excluded, along with the resulting sample number and
• Sample detail: Details of the same, e.g., race, sex, age,       gender details. Finally, Figure 4 shows a sentence from the
   community, university, AMT, etc.                               summary of an article, tagging “sem” (Structural Equation
                                                                  Modeling) as a methodology, sample noun and details, and
• Compensation: How participants are compensated.                 five experimental factors that are assessed in the paper.
• Exclusion count: Number excluded from sample.                      Our model next processes the resulting classified spans –
• Exclusion reason: Stated reason(s) for why elements are         as shown in Figures 2, 3, and 4 – to opportunistically ex-
   excluded from the final sample set.                            tract domain-specific numerical and Boolean features. For
                                                                  example, the sample count and exclusion count are both ex-
• Experiment reference: Name or reference to an experi-           pected to be integers, so it attempts to coerce “one hundred
   ment within the document.                                      and ninety - seven” (Figure 2) and “Eight” (Figure 3) to
• Experimental condition: Named or unnamed control or             integers and populate corresponding integer features. Sim-
   experimental condition employed.                               ilarly, the model uses a lexicon-based approach over the
                                                                             node_id n-2173f938                        F(1, 26) = 12.84, p < .001
                                                                               isa   Subtest, PTest   subtestOf   node_id      n-31c678bf
                                                                              name       PTest                      isa          StatTest
                                                                             ordinal       <                       name F(1, 26) = 12.84, p < .001
                                                                                                                                                       testOf
                                                                               val       0.001


Table 3: Precision/Recall/F1 results on a holdout set of in-                         FTest
formation extraction examples.                                               node_id n-17cd5626
                                                                               isa   Subtest, FTest
  Transformer Model             Precision    Recall     F1                    name       FTest
                                                                             df_val1      1.0         subtestOf
  distilbert uncased              0.62        0.70     0.66                  df_val2     27.0                           F(1, 27) = 3.37, p = .08
  roberta base                    0.59        0.64     0.61                    val       3.37                      node_id     n-e7268e6d              testOf
  bert large uncased              0.61        0.71     0.66                                                          isa         StatTest
  scibert scivocab uncased        0.67        0.74     0.70                          PTest            subtestOf     name F(1, 27) = 3.37, p = .08
  scibert scivocab cased          0.62        0.73     0.67                  node_id n-16cdd8b4
                                                                               isa   Subtest, PTest
                                                                              name       PTest
                                                                             ordinal       =
sample descriptor spans to populate Boolean features in-                       val       0.08
dicating whether participants’ genders, age, race, religion,
and community are specified, what the recruitment pool is         Figure 5: Semantic
                                                                                  FTest                                           testOf
                                                                                           subgraph for a local cluster of two statis-
(e.g., AMT, universities, etc.), and how they are compen-                  node_id n-dcd61205
                                                                  tical tests  extracted from a paper. F(1, 53) = 15.16, p < .001
sated (e.g., course credit, monetary, etc.). These numerical,                  isa   Subtest, FTest
                                                                                                      subtestOf   node_id      n-41ffa4c3
                                                                              name       FTest
Boolean, and lexical features populate the argument struc-                                                          isa          StatTest
                                                                             df_val1      1.0
ture of the paper, which we describe in subsequent sections.                 df_val2     53.0
                                                                                                                   name F(1, 53) = 15.16, p < .001
                                                                                                                                                       testOf
   We train a model by fine-tuning SciBERT and DistilBERT         machine learning.
                                                                               val       15.16        subtestOf
uncased models, and we evaluate using the same 558/62 ran-
                                                                                                                                                       testOf
domized train/test split of our 620 labeled examples. Table 3                        PTest
shows the results across four different transformer models        3.6      Assembly       into Argument Structure
                                                                            node_id n-237b1f09
for 100 iterations each, showing best performance from the                     isa   Subtest, PTest
                                                                              name       PTest
SciBERT uncased model. While our model shows favorable            After extracting individual spans and subgraphs from the un-
                                                                             ordinal       <
results for our relatively small dataset of 620 examples, we      structured prose of a scientific article, we assemble the ex-
                                                                               val       0.001
are presently extending our dataset.                              tracted information into a global graph that we refer to as the
   One limitation of the present sentence-level analysis is       argument structure FTest      of the document. As implied by its name,
that cross-sentence coreferring expressions are unresolvable      the argument         structure is designed F(1,
                                                                          node_id n-cc46b844                             to 53)
                                                                                                                             express
                                                                                                                                 = 4.26, p the
                                                                                                                                           < .05 premises,
within the model, although – since we are not extracting          evidence,isaandSubtest,     FTest
                                                                                        observations         in a node_id
                                                                                                      subtestOf   scientific n-35ab5b7d
                                                                                                                                    article, ultimately
                                                                            name         FTest
complex relations across entities – most context-sensitive        in support     of its conclusions.                 isa             StatTest
                                                                           df_val1         1.0
concepts such as sample-size and exclusion-count have am-                                                          name F(1, 53) = 4.26, p < .05        testOf
                                                                     The system
                                                                           df_val2      generates
                                                                                          53.0          the argument structure by iterating
ple context within the sentence itself. We plan to quantify                  val          4.26
                                                                  over the sequence of text                segments and associated seman-
                                                                                                      subtestOf
the benefit of adding cross-sentence coreference resolution
                                                                  tic tags (see Table 2 for a list of tags). Upon encountering
in future work.                                                                      PTest
                                                                  a transition      in  semantic tags, such as a new Methodology
                                                                          node_id n-dc757884
3.5   Statistical Test Extraction                                 section after
                                                                             isa     a  Discussion
                                                                                     Subtest, PTest        section, the system instantiates a
                                                                  new Study namenodePTest  within its argument structure, and then adds
The descriptions of statistical tests in scientific documents     the BERT-extracted
                                                                           ordinal          <    features (see above) and statistical test
are much more structured than descriptions of samples,            subgraphs  val (see above)
                                                                                          0.05        as constituents of the new node. In
methods, and factors. Consequently, our system uses Python        this fashion, the system accumulates nodes for Introduction,
regular expressions (rather than a transformer-based model)                          FTest
                                                                  Study, and Discussion              prose. A small set of features for two
to extract statistical tests, motivated by processing speed and           node_id n-492d51f2
                                                                  Study nodes         from      the   same paperF(1,are    27) =shown
                                                                                                                                  13.02, p <in.001Figure 6,
tailorability. Our regular expressions identify 25 different                 isa     Subtest, FTest
                                                                  populated nameby    information
                                                                                         FTest            extraction.
                                                                                                      subtestOf  node_id           n-ce4fd782
statistical tests and values, including p, R, R2 , d, F-tests,             df_val1
                                                                                                                   isa               StatTest
                                                                                           1.0 layout of the argument structure allows
T-tests, mean, median, standard deviation, confidence inter-         The graph-based                              name F(1, 27) = 13.02, p < .001 testOf
                                                                           df_val2        27.0
vals, odds ratios, non-significance, and more. These regular      the system       to assess        independent replicability concerns in
                                                                             val         13.02        subtestOf
expressions were implemented for this system and were not         a context-sensitive, explainable                   fashion. For example, as
reused from a previous system.                                    shown in Figure    PTest
                                                                                            6,  the   sample     size    of 24 for the study node
   Our statistical test extractor then clusters extracted ele-    at left may     impact
                                                                          node_id n-428138ae   the   judgment      of   that   study’s replicability,
ments by proximity: Figure 5 shows two statistical tests          but it doesisa not    necessarily
                                                                                     Subtest, PTest        impact the replicability judgment
(F (1, 27) = 3.37 and p = .08) that correspond to the same        of the study
                                                                            name at PTestright, in the same paper. Likewise specify-
                                                                           ordinal
                                                                  ing the participants’     <    race in Figure 6 (left) may improve the
result. The system expresses each sub-test (e.g., F-test and                 val         0.001
p-test) as a separate sub-test leaf of the overall statistical    replicability judgment of that study but should not affect the
test. Each sub-test describes distinctive features; for exam-     other study that does not specify participant race.
                                                                                     FTest
ple, the p-test includes a value and an ordinal feature with         Eachnode_id
                                                                            node in      the directed argument
                                                                                      n-87708ab9                            structure graph is con-
                                                                                                                        F(1, 26) = 2.75, p = .10
the value “=” since the authors reported equality instead of      nected directly
                                                                             isa         or indirectly
                                                                                     Subtest, FTest          to
                                                                                                      subtestOf
                                                                                                                the   node
                                                                                                                  node_id      representing
                                                                                                                                   n-eee2aebb        the sci-
                                                                            name         FTest
“<” or “>,” and the F-test includes two degrees of freedom        entific article      itself.   In this fashion,    isa the argumentStatTest      structure
                                                                           df_val1         1.0
and a value. Clustering these statistical tests into subgraphs    is a fully-connected
                                                                           df_val2        26.0
                                                                                                  graph that supports
                                                                                                                   name F(1, 26)   graph
                                                                                                                                      = 2.75, pand
                                                                                                                                                = .10 pattern
                                                                                                                                                        testOf
helps identify duplicate reports of experimental results, and     matching,valconfidence  2.75      propagation,
                                                                                                      subtestOf
                                                                                                                        and    feature      extraction       in
it provides context for downstream graphical analysis and         order to judge and explain replicability.
                                                                                     PTest
                                                                             node_id n-de344fde
                                                                               isa   Subtest, PTest
                                                                              name       PTest                                  p > .05
                                                                             ordinal       =
                                                                                                                         node_id n-d9cb8b6c
                                                                               val        0.1
                                                                                                                           isa     StatTest
                                                                                                                          name      p > .05            testOf
                                                                                                      subtestOf
                                                                                     PTest
                                                                             node_id n-7ac60173
      Figure 6: Populating argument structure for two studies using information extracted across sentences and paragraphs.


3.7    Replicability Prediction                                                            4    Discussion
                                                                   Targeting replicability in the evaluation of research is a di-
We train a random forest model to classify the replicabil-         verse task that is often not prioritized during peer review. Im-
ity of of papers. Our dataset is a collection of papers from       proving human comprehension of decisive factors is a cru-
the Journal of Experimental Psychology, and we are able            cial push towards integrating automated systems for repli-
to mostly separate the replicable and non-replicable experi-       cability prediction into the review process. In this work, we
ments. We plan to improve that separation and the calibra-         develop an automated system for identifying, extracting, and
tion of the replicability scores in future work.                   organizing those factors. We introduce measures of language
                                                                   quality such as subjectivity, sentiment, and readability; we
                                                                   semantically tag text in order to understand language con-
Ground Truth Replications To evaluate the ability of our           text; we extract statistical test information, linguistic rela-
model to correctly separate replicated studies from those that     tionships, and methodologies; and then we construct a hier-
did not replicate, we train and test on replication attempts       archical argument structure and perform replicability classi-
for the papers from the Journal of Experimental Psychol-           fication. These factors and their organization are intuitive to
ogy. We are able to collect approximately 150 PDFs of these        readers and allow for both top-down and bottom-up under-
papers for parsing and processing. As the replication studies      standing of a paper’s methods. Although leaving the review
are performed by different groups, there is variability in the     process entirely up to automation is not feasible, human-
number of features available in the given data. Many contain       in-the-loop systems that guide reviewers through important
simple statistics such as sample size, but only a few contain      text, factors, and predictions can reduce the amount of non-
p-value. To expand the set of available features, we manu-         replicable papers that make it through review.
ally mine them from the parsed PDFs. This gives features
related to the number and significance of p-values reported,                           5       Future Work
a proxy to the number of figures present, the presence of ef-
fect size, and the presence of an appendix. Furthermore, we        One of the main focuses of our future work is to extend
judge replicability based on the percentage of known repli-        our ground truth datasets and evaluate replicability predic-
cations to known failures (e.g. in a set of replication studies,   tion across the combinations of features that we develop in
if an experiment was replicated 5 times and failed to repli-       the current work. Due to the limited data size, the evaluation
cate 3 times, we say the experiment replicated).                   set is too small to definitively select the best combination of
                                                                   features for replicability prediction.
                                                                       We are also working to broaden our system’s fea-
Prediction Model & Results We train a binary random                tures and capabilities. For instance, we are incorporating
forest classifier in a similar fashion to (Altmejd et al. 2019)    a transformer-based information extractor that extracts the
to predict the replicability of an experiment. We use 5000         causal, proportional, and comparative relationships in scien-
estimators with a max depth of 3. We evaluate our perfor-          tific claims (Magnusson and Friedman 2021) to relate the
mance with AUC and accuracy, shown in Table 4. We select           claims within and across scientific documents in our cor-
11 psychology papers from the dataset and use these as the         pus. To improve human interpretation, we are working to
evaluation set. We predict using experiment p-value and the        produce an explainability interface for users to inspect our
presence of effect size (binary). The results for the individ-     extractions, predictions, and argument structure for guided
ual papers are shown in Table 5.                                   paper understanding.
           Table 4: The accuracy and AUC for a random forest classifier with 5000 estimators and a max depth of 3.

                                                                  Accuracy    AUC
                                               Evaluation Set       0.90      0.89
Table 5: The individual predictions and labels for each paper in the evaluation set. The model correctly predicts 10 of the 11
papers.

                                             Paper Reference                    Label    Prediction
                              (Nosek, Banaji, and Greenwald 2002) Exp. 1          1         0.70
                              (Nosek, Banaji, and Greenwald 2002) Exp. 2          1         0.66
                                            (Soto et al. 2008)                    1         0.66
                                  (Monin, Sawyer, and Marquez 2008)               0         0.36
                                      (Purdie-Vaughns et al. 2008)                0         0.31
                                    (Goff, Steele, and Davies 2008)               0         0.27
                                   (Payne, Burkley, and Stokes 2008)              1         0.27
                                       (Shnabel and Nadler 2008)                  0         0.25
                                      (Lemay Jr and Clark 2008a)                  0         0.24
                                 (Fischer, Greitemeyer, and Frey 2008)            0         0.23
                                      (Lemay Jr and Clark 2008b)                  0         0.06


   We will further assess the validity of the elements of a          Beltagy, I.; Lo, K.; and Cohan, A. 2019. SciBERT: A pre-
paper, i.e., the confidence that we have in the claims in the        trained language model for scientific text. arXiv preprint
paper given the assumptions made by it, by using an exist-           arXiv:1903.10676 .
ing probabilistic inference and recognition system, SUNNY,           Camerer, C. F.; Dreber, A.; Forsell, E.; Ho, T.-H.; Huber, J.;
originally developed for planning (Kuter et al. 2004) and            Johannesson, M.; Kirchler, M.; Almenberg, J.; Altmejd, A.;
social network analysis (Kuter and Golbeck 2007, 2010).              Chan, T.; et al. 2016. Evaluating replicability of laboratory
SUNNY propagates local estimates of uncertainty through              experiments in economics. Science 351(6280): 1433–1436.
large models. Its most basic output is the probability, as a
function of time, that a particular event will be true. We have      Cer, D.; Yang, Y.; Kong, S.-y.; Hua, N.; Limtiaco, N.; John,
extended SUNNY for k-nearest neighbors (kNN) learning                R. S.; Constant, N.; Guajardo-Cespedes, M.; Yuan, S.; Tar,
and prediction capabilities, as well as a Naive Bayes diag-          C.; et al. 2018. Universal sentence encoder. arXiv preprint
noses of confidence scores based on the (kNN) clustering.            arXiv:1803.11175 .
The numeric and qualitative features in our argument struc-          Chan, J.; Chang, J. C.; Hope, T.; Shahaf, D.; and Kittur, A.
tures form the basis of the kNN clustering and we will ex-           2018. Solvent: A mixed initiative system for finding analo-
tend these measures towards predicting replicability scores          gies between research papers. Proceedings of the ACM on
in SUNNY in the near future.                                         Human-Computer Interaction 2(CSCW): 1–21.
                                                                     Cohan, A.; Feldman, S.; Beltagy, I.; Downey, D.; and Weld,
                  Acknowledgements                                   D. S. 2020. Specter: Document-level representation learning
This material is based upon work supported by the Defense            using citation-informed transformers. In Proceedings of the
Advanced Research Projects Agency (DARPA) and Army                   58th Annual Meeting of the Association for Computational
Research Office (ARO) under Contract No. W911NF-20-                  Linguistics, 2270–2282.
C-0002. Any opinions, findings, and conclusions or recom-
mendations expressed in this material are those of the au-           Dasigi, P.; Burns, G. A.; Hovy, E.; and de Waard, A. 2017.
thor(s) and do not necessarily reflect the views of the De-          Experiment segmentation in scientific discourse as clause-
fense Advanced Research Projects Agency (DARPA) and                  level structured prediction using recurrent neural networks.
Army Research Office (ARO).                                          arXiv preprint arXiv:1702.05398 .
                                                                     Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018.
                        References                                   Bert: Pre-training of deep bidirectional transformers for lan-
Allen, J.; de Beaumont, W.; Galescu, L.; and Teng, C. M.             guage understanding. arXiv preprint arXiv:1810.04805 .
2015. Complex event extraction using DRUM. Technical                 Dreber, A.; Pfeiffer, T.; Almenberg, J.; Isaksson, S.; Wil-
report, Florida Institute for Human and Machine Cognition            son, B.; Chen, Y.; Nosek, B. A.; and Johannesson, M. 2015.
Pensacola United States.                                             Using prediction markets to estimate the reproducibility of
Altmejd, A.; Dreber, A.; Forsell, E.; Huber, J.; Imai, T.;           scientific research. Proceedings of the National Academy of
Johannesson, M.; Kirchler, M.; Nave, G.; and Camerer, C.             Sciences 112(50): 15343–15347.
2019. Predicting the replicability of social science lab ex-         Fischer, P.; Greitemeyer, T.; and Frey, D. 2008. Self-
periments. PloS one 14(12).                                          regulation and selective exposure: the impact of depleted
self-regulation resources on confirmatory information pro-       Magnusson, I. H.; and Friedman, S. E. 2021. Graph
cessing. Journal of personality and social psychology 94(3):     Knowledge Extraction of Causal, Comparative, Predictive,
382.                                                             and Proportional Associations in Scientific Claims with a
Friedman, S.; Burstein, M.; McDonald, D.; Plotnick, A.;          Transformer-Based Model. In AAAI Workshop on Scientific
Bobrow, L.; Bobrow, R.; Cochran, B.; and Pustejovsky, J.         Document Understanding.
2017. Learning by reading: Extending and localizing against      Monin, B.; Sawyer, P. J.; and Marquez, M. J. 2008. The
a model. Advances in Cognitive Systems 5: 77–96.                 rejection of moral rebels: resenting those who do the right
Gardner, M.; Grus, J.; Neumann, M.; Tafjord, O.; Dasigi, P.;     thing. Journal of personality and social psychology 95(1):
Liu, N. F.; Peters, M.; Schmitz, M.; and Zettlemoyer, L. S.      76.
2017. AllenNLP: A Deep Semantic Natural Language Pro-            Nosek, B. A.; Banaji, M. R.; and Greenwald, A. G. 2002.
cessing Platform.                                                Math= male, me= female, therefore math6= me. Journal of
Goff, P. A.; Steele, C. M.; and Davies, P. G. 2008. The space    personality and social psychology 83(1): 44.
between us: stereotype threat and distance in interracial con-   Open Science Collaboration; et al. 2015. Estimating the re-
texts. Journal of personality and social psychology 94(1):       producibility of psychological science. Science 349(6251).
91.                                                              Payne, B. K.; Burkley, M. A.; and Stokes, M. B. 2008. Why
Hakala, K.; and Pyysalo, S. 2019. Biomedical named entity        do implicit and explicit attitude tests diverge? The role of
recognition with multilingual BERT. In Proceedings of The        structural fit. Journal of personality and social psychology
5th Workshop on BioNLP Open Shared Tasks, 56–61.                 94(1): 16.
Head, M. L.; Holman, L.; Lanfear, R.; Kahn, A. T.; and           Plavén-Sigray, P.; Matheson, G. J.; Schiffler, B. C.; and
Jennions, M. D. 2015. The extent and consequences of p-          Thompson, W. H. 2017. The readability of scientific texts
hacking in science. PLoS Biol 13(3): e1002106.                   is decreasing over time. Elife 6: e27725.
Huber, P.; and Carenini, G. 2019. Predicting discourse struc-    Purdie-Vaughns, V.; Steele, C. M.; Davies, P. G.; Ditlmann,
ture using distant supervision from sentiment. arXiv preprint    R.; and Crosby, J. R. 2008. Social identity contingencies:
arXiv:1910.14176 .                                               how diversity cues signal threat or safety for African Amer-
Kincaid, J. P.; Fishburne Jr, R. P.; Rogers, R. L.; and          icans in mainstream institutions. Journal of personality and
Chissom, B. S. 1975. Derivation of new readability formu-        social psychology 94(4): 615.
las (automated readability index, fog count and flesch read-     Shnabel, N.; and Nadler, A. 2008. A needs-based model of
ing ease formula) for navy enlisted personnel. Technical         reconciliation: satisfying the differential emotional needs of
report, Naval Technical Training Command Millington TN           victim and perpetrator as a key to promoting reconciliation.
Research Branch.                                                 Journal of personality and social psychology 94(1): 116.
Klein, R. A.; Ratliff, K. A.; Vianello, M.; Adams Jr, R. B.;     Soto, C. J.; John, O. P.; Gosling, S. D.; and Potter, J. 2008.
Bahnı́k, Š.; Bernstein, M. J.; Bocian, K.; Brandt, M. J.;       The developmental psychometrics of big five self-reports:
Brooks, B.; Brumbaugh, C. C.; et al. 2014. Investigating         Acquiescence, factor structure, coherence, and differentia-
variation in replicability. Social psychology .                  tion from ages 10 to 20. Journal of personality and social
Kuter, U.; and Golbeck, J. 2007. Sunny: A new algorithm for      psychology 94(4): 718.
trust inference in social networks using probabilistic confi-    Valenzuela-Escárcega, M. A.; Babur, Ö.; Hahn-Powell, G.;
dence models. In AAAI.                                           Bell, D.; Hicks, T.; Noriega-Atala, E.; Wang, X.; Surdeanu,
                                                                 M.; Demir, E.; and Morrison, C. T. 2018. Large-scale auto-
Kuter, U.; and Golbeck, J. 2010. Using Probabilistic Con-
                                                                 mated machine reading discovers new cancer-driving mech-
fidence Models for Trust Inference in Web-Based Social
                                                                 anisms. Database 2018.
Networks. Transactions on Internet Technology (TOIT) 7:
1377–1382.                                                       Waldie, B. 2009. Automator for Mac OS X 10.6 Snow Leop-
                                                                 ard: Visual QuickStart Guide .
Kuter, U.; Nau, D.; Gossink, D.; and Lemmer, J. F. 2004.
Interactive Course-of-Action Planning Using Causal Mod-          Yang, Y.; Youyou, W.; and Uzzi, B. 2020. Estimating the
els. In International Conference on Knowledge Systems for        deep replicability of scientific findings using human and ar-
Coalition Operations (KSCO-2004), 37–52.                         tificial intelligence. Proceedings of the National Academy of
                                                                 Sciences 117(20): 10762–10768.
Lemay Jr, E. P.; and Clark, M. S. 2008a. ” Walking on
eggshells”: how expressing relationship insecurities perpet-
uates them. Journal of personality and social psychology
95(2): 420.
Lemay Jr, E. P.; and Clark, M. S. 2008b. How the head
liberates the heart: Projection of communal responsiveness
guides relationship promotion. Journal of Personality and
Social Psychology 94(4): 647.
Loria, S. 2018. textblob Documentation. Release 0.15 2.