Toward A Robust Method for Understanding the Replicability of Research Ben Gelman,1 Chae Clark,1 Scott Friedman,2 Ugur Kuter,2 James Gentile1 1 Two Six Labs, Arlington, VA, USA 2 SIFT, Minneapolis, MN, USA {ben.gelman, chae.clark, james.gentile}@twosixlabs.com, {friedman, ukuter}@sift.net Abstract reviewer time, inability to run new experiments, mislead- ing statistics (Head et al. 2015), and the myriad variables The replicability of research is crucial for building trust in that affect a reviewer’s perception of the research, such as the peer review process and transitioning knowledge to real- world applications. While manual peer review excels in some the readability of the explanations, clarity and detail of the regards, the variability of reviewer expertise, publication re- methodology, significance of the authors’ claims, etc. These quirements, and research domains brings about uncertainty variables that determine replicability can have varying levels in the process. Replicability, in particular, is not necessarily of impact on the decision to accept a paper due to reviewer a priority; this is evidenced by repeated failures in replica- bias, research domain, and prior standards for acceptance. tion attempts such as the Psychology Reproducibility Project, Not all acceptances of research are because it is replicable. where 61 of 100 replications fail. Improving human com- Mapping these variables to actual replication outcomes can prehension of decisive factors is crucial for integrating au- produce a less biased estimation of replicability. tomated systems for replicability prediction into the review In this work, we develop a novel method for understand- process. We develop a robust, automated method for seman- ing replicability given only a PDF of the research while en- tic parsing, information extraction, and replication prediction that operates directly on PDFs. We introduce features that capsulating a wider, more robust set of factors than prior have not been explored in prior work, construct argument art. Using a combination of rule-based processing and ma- structures to guide understanding, and provide preliminary chine learning, we perform consistent semantic parsing, fea- results for replication prediction. ture extraction, and replicability classification. Our main contributions are as follows: 1 Introduction • Consistent text extraction The replicability of research is crucial for building trust in • Automated classification of semantic flow the peer review process and for the transition of knowledge • Multifaceted feature extraction to real-world applications. Unfortunately, current attempts at replicating research show that many research papers do • Preliminary replication prediction results not replicate, with 61 of 100 failing the Psychology Repro- ducibility Project (Open Science Collaboration et al. 2015), 2 Related Work 7 of 18 failing laboratory economics experiments (Camerer The work related to our contributions are multiple-fold: pre- et al. 2016), 3 of 13 failing the Many Labs Replication vious literature has attempted to achieve a similar goal of Project (Klein et al. 2014), and more. predicting replicability, but there are a variety of methods Currently, research is manually peer reviewed by a few that are relevant to our pipeline that have not been used for experts often donating their time via venues such as con- replicability prediction. We cover both aspects of that prior ferences and journals. While manual peer review excels in work here. some regards, the variability of reviewer expertise, publica- tion requirements, and research domains brings about multi- 2.1 Predicting Replicability ple levels of uncertainty. Additionally, peer review does not The replication crisis is repeatedly noted throughout repli- specifically attempt to identify the replicability of research, cability literature (Open Science Collaboration et al. 2015; and, despite the increasing amount of automated analysis Klein et al. 2014; Camerer et al. 2016). Because peer re- tools and replication prediction systems, there have been few view is currently an entirely manual process, a natural conse- changes to the review process over the years. quence is the desire to automate the understanding of repli- Determining replicability at review time is challenging cability. An early attempt uses prediction markets to deter- for a multitude of reasons: limited access to data, limited mine a ”market price” for research studies, representing the Copyright c 2021 for this paper by its authors. Use permitted un- likelihood that those studies would replicate (Dreber et al. der Creative Commons License Attribution 4.0 International (CC 2015). The prediction markets correctly predict 29/41 (71%) BY 4.0). replications, but this method still requires approximately 50 domain experts to participate in the market. This is an im- 3.1 PDF Extraction practical requirement for situations such as peer-reviewed Extracting text from a PDF and formatting it into informa- conferences that often have three reviewers per paper. In tive segments are necessary steps for employing natural lan- (Altmejd et al. 2019), the authors attempt to predict the guage approaches. We use Automator (Waldie 2009) to run replicability of research by gathering features from within the built-in PDF to RTF extraction tool. The RTF files main- or about the research itself: this involves statistical design tain formatting information, but we use the command line properties such as sample size, effect size, and p-value; or utility textutil to convert the RTF files to HTML files, which descriptive aspects, such as the number of citations, number we find to be more amenable to rule-based processing. We of authors, and how subjects are compensated. By aggregat- apply rules to the extraction because it is an erroneous pro- ing a dataset of 131 direct replications, they achieve approxi- cess that fails around artifacts such as tables, captions, or mately 70% prediction accuracy with random forest models. footnotes. HTML representations are each parsed into a hash Although the feature extraction is still a manual process, lo- map where the keys are content styles and the values are cating relevant features in a paper is a tractable problem for all concatenated words and white-spaces of that style in the an individual, which is a substantial improvement over the order they appear. The main content string of the paper is the prediction markets. (Yang, Youyou, and Uzzi 2020) take identified as the longest value, by character count, in this the automation a step further, and they obtain a 69% pre- hash map. The main content string is used for all subsequent diction accuracy by training on word embeddings of the re- processing. search manuscript’s text. We automatically extract features of prior work, generate a new set of features, and estimate 3.2 Semantic Tagging replicability with higher accuracy. A key element to understanding the structure of an argument 2.2 Natural Language Processing is the semantic context in which the argument is made. To that end, we develop a machine learning model to annotate Whether attempting to obtain statistical test information or paragraphs based on their content. This is similar to the an- to operate directly on text, natural language processing is notation work presented in (Chan et al. 2018), (Huber and critical to the automation of manuscript featurization. A Carenini 2019), and (Dasigi et al. 2017). Here though, we crucial innovation in the realm of general purpose natu- modify the annotation scheme to better match the problem ral language modeling is the use of models such as BERT of information extraction for replication prediction. We in- (Devlin et al. 2018), which are pre-trained on large, unsu- fer the discourse class for each sentence and perform an av- pervised corpora. Through a fine-tuning step, these models eraging of outputs to obtain the final class. This yields the transfer to new problems and domains. A particularly rele- following modified annotation scheme with 6 elements: vant application is SciBERT (Beltagy, Lo, and Cohan 2019), which is pre-trained on scientific publications from multi- • Introduction: Problem statement and paper structure. ple domains. The authors show that this pre-training sig- • Methodology: Specifics of the study, including partici- nificantly improves results on downstream tasks related to pants, materials, and models. scientific language. Recent work that focuses on scientific • Results: Experimental results and statistical tests. articles leverages these models to identify entities (Hakala and Pyysalo 2019), extract events and relationships (Allen • Discussion: Author’s interpretation of results and impli- et al. 2015; Valenzuela-Escárcega et al. 2018), and relate cations for the findings. extracted events to domain models (Friedman et al. 2017). • Research Practice: Conflicts of interest, funding sources, Our work utilizes fine-tuning to create span-based informa- and acknowledgements. tion extraction with a broader context that includes sample • Reference: Citations sizes, experimental methodologies, excluded sample counts, statistical tests, and more. Additionally, rather than focus- Annotating Training Data In order to create a training ing on the findings and contributions of scientific articles, set for discourse class prediction, we extract text from 838 we characterize methodologies, materials, confidence, and social and behavioral science (SBS) research articles. In ad- replicability. dition to the full text, these extractions contain the section header. This is what we use as our annotation, resulting in 3 Approach 81,001 labeled sentences. Due to the variation in section We use a multi-stage pipeline in order to modularize each header names as a result of domain, tradition, or personal component of the extraction and prediction process. Each preference, we assign a set of keywords to each discourse component can be easily changed out as enhancements to class and label a section/segment of text if the section header any components are developed, such as improvements in is grammatically close to a keyword. The keywords used to PDF parsing, new rules in rule-based methods, and updates create the dataset are: for machine learning models. Figure 1 shows the flow of raw • Introduction: {Introduction} PDFs through various components, leading to the output of • Methodology: {Methodology, Analysis, Experiment, JSON files that are formatted with the article’s text and asso- Method, Procedure, Design, Material, Participant} ciated features that can be used in downstream models. The pipeline comprises several main components: PDF extrac- • Results: {Results} tion, semantic tagging, and information extraction. • Discussion: {Discussion, Conclusion} Figure 1: The full pipeline. We combine PDF extraction and rule-based parsing to generate strings of the research text, apply machine learning-based semantic tagging, and then extract features with machine-learning and rule-based approaches to gener- ate a single formatted JSON per paper. This formatted JSON is convenient for downstream models, such as argument structure construction and replication prediction. Table 1: The number of sentences per discourse tag extracted Table 2: Precision/Recall/F1 results on a holdout set of an- from the training data. notated sentences. Discourse Tag Sentence Count Precision Recall F1-score Introduction 13,023 Introduction 0.53 0.70 0.60 Methodology 24,930 Methodology 0.80 0.47 0.59 Results 18,308 Results 0.56 0.64 0.60 Discussion 14,233 Discussion 0.60 0.52 0.56 Research Practices 353 Research Practices 0.96 0.73 0.83 Reference 10,153 Reference 0.75 0.97 0.84 • Research Practices: {Acknowledgements, Funding, pers that have yet to be cited, which is critical for applica- Ethics Statement, Competing Interests, Ethical Approval} tions that target recent scientific paper • Reference: {Reference, Bibliography} 3.3 Information Extraction Creating Semantic Vector Representations Given a sen- In addition to the content and context extracted from the ar- tence extracted from a research article, we use the Univer- ticle, we further include features unrelated to the structure sal Sentence Encoder model (Cer et al. 2018), which is de- of the paper, but that are essential to the analysis of a pa- signed to embed words, sentences, and small paragraphs into per’s claims. These include both natural language features a semantically-related latent space. We represent a sentences and statistical test results. as a 512-dimensional vector that encode the general seman- tics and context. Language Quality Regardless of the validity of a paper’s We use that 512-dimensional vector as input to a fully- methodology and analysis, a failure to adequately communi- connected hidden layer of size 512, followed by another full- cate that information hinders others from using or replicat- connected hidden layer of size 256, followed by an output ing that research. As a means of assessing the quality of the layer of size 6 (representing the discourse classes). A soft- writing itself, we compute three metrics over each paragraph max activation after the output layer provides the discourse in the text: readability, subjectivity, and sentiment. The idea prediction. We use 50% dropout between the layers and a for readability being related to the ability to reproduce find- balanced sampling scheme to avoid overfitting to a single ings is generated from the discussion in (Plavén-Sigray et al. class. We use precision and recall to evaluate the prediction 2017). We consider subjectivity due to discussions with so- performance, shown in table 2. cial and behavioral science domain experts about inferring The following is an example input whose actual section possible questionable research practices. Finally, positive or header is: 2.5 Inference from (Cohan et al. 2020). Our negative sentiment in the results or discussion sections may model predicts the title: Methodology. indicate biases towards the outcomes of the research. Al- At inference time, the model receives one paper, P, and though any one of these features may not directly express it outputs the SPECTER’s Transformer pooled output acti- replicability, they do provide a holistic view of the writing. vation as the paper representation for P (Equation 1). We Using each paragraph as input, we compute readability note that for inference, SPECTER requires only the title and using Flesch Readability Ease (Kincaid et al. 1975), senti- abstract of the given input paper; the model does not need ment using the AllenNLP suite (Gardner et al. 2017), and any citation information about the input paper. This means subjectivity using the TextBlob package (Loria 2018). This that SPECTER can produce embeddings even for new pa- produces a distribution of these features over the text that we Figure 2: Labeling spans for sample size (samp num), sample details (samp detail), and subject compensation (compensation) in a specific study (exper ref). Figure 3: Labeling spans for the number of sample elements excluded (excl num) and the stated reason they were excluded (excl reason), as well as the final sample number. Figure 4: Labeling the sample, experimental methods employed (method or material), and factors (exper factor) under study. can relate to discourse, experimental results (statistics), and • Experimental variable/factor: Elements measured or re- domain-specific extractions. ported in the document, e.g., reaction time, participant preference, accuracy on a task. 3.4 Methodological Information Extraction • Method or material: experimental methods or materials The unstructured prose of scientific documents includes key employed, e.g., ANOVA, questionnaire, priming. features for assessing replicability, such as sample sizes, populations, conditions, experimental variables, methods, We extract these features using a transformer-based, materials, exclusion criteria, and participant compensation. token-level classifier that processes each sentence sep- Much of this information is available as concise spans of arately. The output of the classifier model is a Be- text in the document: “twenty-four” may be a sample size; gin/Inside/Outside (BIO) prediction for each token in a sen- “undergraduates” may be a population description; “reac- tence. This assumes that no labels overlap in the sentence, tion time” may be a dependent variable; and so on. Conse- which is one constraint of our dataset. quently, we are not interested in extracting and classifying We illustrate the above labels as predicted on some typ- relations at this phase of analyses; rather, we optimize our ical sentences from research articles in the SBS literature. information extractor to classify individual spans within the Figure 2 shows our model’s information extraction results text with context-sensitive labels. for a typical statement introducing a population and sam- Our dataset includes 620 labeled examples that are anno- ple size. This tags the English spans for sample count “one tated with the following properties: hundred and ninety - seven,” sample noun “individuals,” de- tails (i.e., age mean, SD, gender, and AMT), an experiment • Sample count: How many elements are in the sample. reference to “this study,” and the compensation of “$1.” In • Sample noun: Noun phrases referring to sample ele- another paper, Figure 3 identifies the number of sample ele- ments, e.g., students, participants, cases, etc. ments excluded, along with the resulting sample number and • Sample detail: Details of the same, e.g., race, sex, age, gender details. Finally, Figure 4 shows a sentence from the community, university, AMT, etc. summary of an article, tagging “sem” (Structural Equation Modeling) as a methodology, sample noun and details, and • Compensation: How participants are compensated. five experimental factors that are assessed in the paper. • Exclusion count: Number excluded from sample. Our model next processes the resulting classified spans – • Exclusion reason: Stated reason(s) for why elements are as shown in Figures 2, 3, and 4 – to opportunistically ex- excluded from the final sample set. tract domain-specific numerical and Boolean features. For example, the sample count and exclusion count are both ex- • Experiment reference: Name or reference to an experi- pected to be integers, so it attempts to coerce “one hundred ment within the document. and ninety - seven” (Figure 2) and “Eight” (Figure 3) to • Experimental condition: Named or unnamed control or integers and populate corresponding integer features. Sim- experimental condition employed. ilarly, the model uses a lexicon-based approach over the node_id n-2173f938 F(1, 26) = 12.84, p < .001 isa Subtest, PTest subtestOf node_id n-31c678bf name PTest isa StatTest ordinal < name F(1, 26) = 12.84, p < .001 testOf val 0.001 Table 3: Precision/Recall/F1 results on a holdout set of in- FTest formation extraction examples. node_id n-17cd5626 isa Subtest, FTest Transformer Model Precision Recall F1 name FTest df_val1 1.0 subtestOf distilbert uncased 0.62 0.70 0.66 df_val2 27.0 F(1, 27) = 3.37, p = .08 roberta base 0.59 0.64 0.61 val 3.37 node_id n-e7268e6d testOf bert large uncased 0.61 0.71 0.66 isa StatTest scibert scivocab uncased 0.67 0.74 0.70 PTest subtestOf name F(1, 27) = 3.37, p = .08 scibert scivocab cased 0.62 0.73 0.67 node_id n-16cdd8b4 isa Subtest, PTest name PTest ordinal = sample descriptor spans to populate Boolean features in- val 0.08 dicating whether participants’ genders, age, race, religion, and community are specified, what the recruitment pool is Figure 5: Semantic FTest testOf subgraph for a local cluster of two statis- (e.g., AMT, universities, etc.), and how they are compen- node_id n-dcd61205 tical tests extracted from a paper. F(1, 53) = 15.16, p < .001 sated (e.g., course credit, monetary, etc.). These numerical, isa Subtest, FTest subtestOf node_id n-41ffa4c3 name FTest Boolean, and lexical features populate the argument struc- isa StatTest df_val1 1.0 ture of the paper, which we describe in subsequent sections. df_val2 53.0 name F(1, 53) = 15.16, p < .001 testOf We train a model by fine-tuning SciBERT and DistilBERT machine learning. val 15.16 subtestOf uncased models, and we evaluate using the same 558/62 ran- testOf domized train/test split of our 620 labeled examples. Table 3 PTest shows the results across four different transformer models 3.6 Assembly into Argument Structure node_id n-237b1f09 for 100 iterations each, showing best performance from the isa Subtest, PTest name PTest SciBERT uncased model. While our model shows favorable After extracting individual spans and subgraphs from the un- ordinal < results for our relatively small dataset of 620 examples, we structured prose of a scientific article, we assemble the ex- val 0.001 are presently extending our dataset. tracted information into a global graph that we refer to as the One limitation of the present sentence-level analysis is argument structure FTest of the document. As implied by its name, that cross-sentence coreferring expressions are unresolvable the argument structure is designed F(1, node_id n-cc46b844 to 53) express = 4.26, p the < .05 premises, within the model, although – since we are not extracting evidence,isaandSubtest, FTest observations in a node_id subtestOf scientific n-35ab5b7d article, ultimately name FTest complex relations across entities – most context-sensitive in support of its conclusions. isa StatTest df_val1 1.0 concepts such as sample-size and exclusion-count have am- name F(1, 53) = 4.26, p < .05 testOf The system df_val2 generates 53.0 the argument structure by iterating ple context within the sentence itself. We plan to quantify val 4.26 over the sequence of text segments and associated seman- subtestOf the benefit of adding cross-sentence coreference resolution tic tags (see Table 2 for a list of tags). Upon encountering in future work. PTest a transition in semantic tags, such as a new Methodology node_id n-dc757884 3.5 Statistical Test Extraction section after isa a Discussion Subtest, PTest section, the system instantiates a new Study namenodePTest within its argument structure, and then adds The descriptions of statistical tests in scientific documents the BERT-extracted ordinal < features (see above) and statistical test are much more structured than descriptions of samples, subgraphs val (see above) 0.05 as constituents of the new node. In methods, and factors. Consequently, our system uses Python this fashion, the system accumulates nodes for Introduction, regular expressions (rather than a transformer-based model) FTest Study, and Discussion prose. A small set of features for two to extract statistical tests, motivated by processing speed and node_id n-492d51f2 Study nodes from the same paperF(1,are 27) =shown 13.02, p ,” and the F-test includes two degrees of freedom entific article itself. In this fashion, isa the argumentStatTest structure df_val1 1.0 and a value. Clustering these statistical tests into subgraphs is a fully-connected df_val2 26.0 graph that supports name F(1, 26) graph = 2.75, pand = .10 pattern testOf helps identify duplicate reports of experimental results, and matching,valconfidence 2.75 propagation, subtestOf and feature extraction in it provides context for downstream graphical analysis and order to judge and explain replicability. PTest node_id n-de344fde isa Subtest, PTest name PTest p > .05 ordinal = node_id n-d9cb8b6c val 0.1 isa StatTest name p > .05 testOf subtestOf PTest node_id n-7ac60173 Figure 6: Populating argument structure for two studies using information extracted across sentences and paragraphs. 3.7 Replicability Prediction 4 Discussion Targeting replicability in the evaluation of research is a di- We train a random forest model to classify the replicabil- verse task that is often not prioritized during peer review. Im- ity of of papers. Our dataset is a collection of papers from proving human comprehension of decisive factors is a cru- the Journal of Experimental Psychology, and we are able cial push towards integrating automated systems for repli- to mostly separate the replicable and non-replicable experi- cability prediction into the review process. In this work, we ments. We plan to improve that separation and the calibra- develop an automated system for identifying, extracting, and tion of the replicability scores in future work. organizing those factors. We introduce measures of language quality such as subjectivity, sentiment, and readability; we semantically tag text in order to understand language con- Ground Truth Replications To evaluate the ability of our text; we extract statistical test information, linguistic rela- model to correctly separate replicated studies from those that tionships, and methodologies; and then we construct a hier- did not replicate, we train and test on replication attempts archical argument structure and perform replicability classi- for the papers from the Journal of Experimental Psychol- fication. These factors and their organization are intuitive to ogy. We are able to collect approximately 150 PDFs of these readers and allow for both top-down and bottom-up under- papers for parsing and processing. As the replication studies standing of a paper’s methods. Although leaving the review are performed by different groups, there is variability in the process entirely up to automation is not feasible, human- number of features available in the given data. Many contain in-the-loop systems that guide reviewers through important simple statistics such as sample size, but only a few contain text, factors, and predictions can reduce the amount of non- p-value. To expand the set of available features, we manu- replicable papers that make it through review. ally mine them from the parsed PDFs. This gives features related to the number and significance of p-values reported, 5 Future Work a proxy to the number of figures present, the presence of ef- fect size, and the presence of an appendix. Furthermore, we One of the main focuses of our future work is to extend judge replicability based on the percentage of known repli- our ground truth datasets and evaluate replicability predic- cations to known failures (e.g. in a set of replication studies, tion across the combinations of features that we develop in if an experiment was replicated 5 times and failed to repli- the current work. Due to the limited data size, the evaluation cate 3 times, we say the experiment replicated). set is too small to definitively select the best combination of features for replicability prediction. We are also working to broaden our system’s fea- Prediction Model & Results We train a binary random tures and capabilities. For instance, we are incorporating forest classifier in a similar fashion to (Altmejd et al. 2019) a transformer-based information extractor that extracts the to predict the replicability of an experiment. We use 5000 causal, proportional, and comparative relationships in scien- estimators with a max depth of 3. We evaluate our perfor- tific claims (Magnusson and Friedman 2021) to relate the mance with AUC and accuracy, shown in Table 4. We select claims within and across scientific documents in our cor- 11 psychology papers from the dataset and use these as the pus. To improve human interpretation, we are working to evaluation set. We predict using experiment p-value and the produce an explainability interface for users to inspect our presence of effect size (binary). The results for the individ- extractions, predictions, and argument structure for guided ual papers are shown in Table 5. paper understanding. Table 4: The accuracy and AUC for a random forest classifier with 5000 estimators and a max depth of 3. Accuracy AUC Evaluation Set 0.90 0.89 Table 5: The individual predictions and labels for each paper in the evaluation set. The model correctly predicts 10 of the 11 papers. Paper Reference Label Prediction (Nosek, Banaji, and Greenwald 2002) Exp. 1 1 0.70 (Nosek, Banaji, and Greenwald 2002) Exp. 2 1 0.66 (Soto et al. 2008) 1 0.66 (Monin, Sawyer, and Marquez 2008) 0 0.36 (Purdie-Vaughns et al. 2008) 0 0.31 (Goff, Steele, and Davies 2008) 0 0.27 (Payne, Burkley, and Stokes 2008) 1 0.27 (Shnabel and Nadler 2008) 0 0.25 (Lemay Jr and Clark 2008a) 0 0.24 (Fischer, Greitemeyer, and Frey 2008) 0 0.23 (Lemay Jr and Clark 2008b) 0 0.06 We will further assess the validity of the elements of a Beltagy, I.; Lo, K.; and Cohan, A. 2019. SciBERT: A pre- paper, i.e., the confidence that we have in the claims in the trained language model for scientific text. arXiv preprint paper given the assumptions made by it, by using an exist- arXiv:1903.10676 . ing probabilistic inference and recognition system, SUNNY, Camerer, C. F.; Dreber, A.; Forsell, E.; Ho, T.-H.; Huber, J.; originally developed for planning (Kuter et al. 2004) and Johannesson, M.; Kirchler, M.; Almenberg, J.; Altmejd, A.; social network analysis (Kuter and Golbeck 2007, 2010). Chan, T.; et al. 2016. Evaluating replicability of laboratory SUNNY propagates local estimates of uncertainty through experiments in economics. Science 351(6280): 1433–1436. large models. Its most basic output is the probability, as a function of time, that a particular event will be true. We have Cer, D.; Yang, Y.; Kong, S.-y.; Hua, N.; Limtiaco, N.; John, extended SUNNY for k-nearest neighbors (kNN) learning R. S.; Constant, N.; Guajardo-Cespedes, M.; Yuan, S.; Tar, and prediction capabilities, as well as a Naive Bayes diag- C.; et al. 2018. Universal sentence encoder. arXiv preprint noses of confidence scores based on the (kNN) clustering. arXiv:1803.11175 . The numeric and qualitative features in our argument struc- Chan, J.; Chang, J. C.; Hope, T.; Shahaf, D.; and Kittur, A. tures form the basis of the kNN clustering and we will ex- 2018. Solvent: A mixed initiative system for finding analo- tend these measures towards predicting replicability scores gies between research papers. Proceedings of the ACM on in SUNNY in the near future. Human-Computer Interaction 2(CSCW): 1–21. Cohan, A.; Feldman, S.; Beltagy, I.; Downey, D.; and Weld, Acknowledgements D. S. 2020. Specter: Document-level representation learning This material is based upon work supported by the Defense using citation-informed transformers. In Proceedings of the Advanced Research Projects Agency (DARPA) and Army 58th Annual Meeting of the Association for Computational Research Office (ARO) under Contract No. W911NF-20- Linguistics, 2270–2282. C-0002. Any opinions, findings, and conclusions or recom- mendations expressed in this material are those of the au- Dasigi, P.; Burns, G. A.; Hovy, E.; and de Waard, A. 2017. thor(s) and do not necessarily reflect the views of the De- Experiment segmentation in scientific discourse as clause- fense Advanced Research Projects Agency (DARPA) and level structured prediction using recurrent neural networks. Army Research Office (ARO). arXiv preprint arXiv:1702.05398 . Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018. References Bert: Pre-training of deep bidirectional transformers for lan- Allen, J.; de Beaumont, W.; Galescu, L.; and Teng, C. M. guage understanding. arXiv preprint arXiv:1810.04805 . 2015. Complex event extraction using DRUM. Technical Dreber, A.; Pfeiffer, T.; Almenberg, J.; Isaksson, S.; Wil- report, Florida Institute for Human and Machine Cognition son, B.; Chen, Y.; Nosek, B. A.; and Johannesson, M. 2015. Pensacola United States. Using prediction markets to estimate the reproducibility of Altmejd, A.; Dreber, A.; Forsell, E.; Huber, J.; Imai, T.; scientific research. Proceedings of the National Academy of Johannesson, M.; Kirchler, M.; Nave, G.; and Camerer, C. Sciences 112(50): 15343–15347. 2019. Predicting the replicability of social science lab ex- Fischer, P.; Greitemeyer, T.; and Frey, D. 2008. Self- periments. PloS one 14(12). regulation and selective exposure: the impact of depleted self-regulation resources on confirmatory information pro- Magnusson, I. H.; and Friedman, S. E. 2021. Graph cessing. Journal of personality and social psychology 94(3): Knowledge Extraction of Causal, Comparative, Predictive, 382. and Proportional Associations in Scientific Claims with a Friedman, S.; Burstein, M.; McDonald, D.; Plotnick, A.; Transformer-Based Model. In AAAI Workshop on Scientific Bobrow, L.; Bobrow, R.; Cochran, B.; and Pustejovsky, J. Document Understanding. 2017. Learning by reading: Extending and localizing against Monin, B.; Sawyer, P. J.; and Marquez, M. J. 2008. The a model. Advances in Cognitive Systems 5: 77–96. rejection of moral rebels: resenting those who do the right Gardner, M.; Grus, J.; Neumann, M.; Tafjord, O.; Dasigi, P.; thing. Journal of personality and social psychology 95(1): Liu, N. F.; Peters, M.; Schmitz, M.; and Zettlemoyer, L. S. 76. 2017. AllenNLP: A Deep Semantic Natural Language Pro- Nosek, B. A.; Banaji, M. R.; and Greenwald, A. G. 2002. cessing Platform. Math= male, me= female, therefore math6= me. Journal of Goff, P. A.; Steele, C. M.; and Davies, P. G. 2008. The space personality and social psychology 83(1): 44. between us: stereotype threat and distance in interracial con- Open Science Collaboration; et al. 2015. Estimating the re- texts. Journal of personality and social psychology 94(1): producibility of psychological science. Science 349(6251). 91. Payne, B. K.; Burkley, M. A.; and Stokes, M. B. 2008. Why Hakala, K.; and Pyysalo, S. 2019. Biomedical named entity do implicit and explicit attitude tests diverge? The role of recognition with multilingual BERT. In Proceedings of The structural fit. Journal of personality and social psychology 5th Workshop on BioNLP Open Shared Tasks, 56–61. 94(1): 16. Head, M. L.; Holman, L.; Lanfear, R.; Kahn, A. T.; and Plavén-Sigray, P.; Matheson, G. J.; Schiffler, B. C.; and Jennions, M. D. 2015. The extent and consequences of p- Thompson, W. H. 2017. The readability of scientific texts hacking in science. PLoS Biol 13(3): e1002106. is decreasing over time. Elife 6: e27725. Huber, P.; and Carenini, G. 2019. Predicting discourse struc- Purdie-Vaughns, V.; Steele, C. M.; Davies, P. G.; Ditlmann, ture using distant supervision from sentiment. arXiv preprint R.; and Crosby, J. R. 2008. Social identity contingencies: arXiv:1910.14176 . how diversity cues signal threat or safety for African Amer- Kincaid, J. P.; Fishburne Jr, R. P.; Rogers, R. L.; and icans in mainstream institutions. Journal of personality and Chissom, B. S. 1975. Derivation of new readability formu- social psychology 94(4): 615. las (automated readability index, fog count and flesch read- Shnabel, N.; and Nadler, A. 2008. A needs-based model of ing ease formula) for navy enlisted personnel. Technical reconciliation: satisfying the differential emotional needs of report, Naval Technical Training Command Millington TN victim and perpetrator as a key to promoting reconciliation. Research Branch. Journal of personality and social psychology 94(1): 116. Klein, R. A.; Ratliff, K. A.; Vianello, M.; Adams Jr, R. B.; Soto, C. J.; John, O. P.; Gosling, S. D.; and Potter, J. 2008. Bahnı́k, Š.; Bernstein, M. J.; Bocian, K.; Brandt, M. J.; The developmental psychometrics of big five self-reports: Brooks, B.; Brumbaugh, C. C.; et al. 2014. Investigating Acquiescence, factor structure, coherence, and differentia- variation in replicability. Social psychology . tion from ages 10 to 20. Journal of personality and social Kuter, U.; and Golbeck, J. 2007. Sunny: A new algorithm for psychology 94(4): 718. trust inference in social networks using probabilistic confi- Valenzuela-Escárcega, M. A.; Babur, Ö.; Hahn-Powell, G.; dence models. In AAAI. Bell, D.; Hicks, T.; Noriega-Atala, E.; Wang, X.; Surdeanu, M.; Demir, E.; and Morrison, C. T. 2018. Large-scale auto- Kuter, U.; and Golbeck, J. 2010. Using Probabilistic Con- mated machine reading discovers new cancer-driving mech- fidence Models for Trust Inference in Web-Based Social anisms. Database 2018. Networks. Transactions on Internet Technology (TOIT) 7: 1377–1382. Waldie, B. 2009. Automator for Mac OS X 10.6 Snow Leop- ard: Visual QuickStart Guide . Kuter, U.; Nau, D.; Gossink, D.; and Lemmer, J. F. 2004. Interactive Course-of-Action Planning Using Causal Mod- Yang, Y.; Youyou, W.; and Uzzi, B. 2020. Estimating the els. In International Conference on Knowledge Systems for deep replicability of scientific findings using human and ar- Coalition Operations (KSCO-2004), 37–52. tificial intelligence. Proceedings of the National Academy of Sciences 117(20): 10762–10768. Lemay Jr, E. P.; and Clark, M. S. 2008a. ” Walking on eggshells”: how expressing relationship insecurities perpet- uates them. Journal of personality and social psychology 95(2): 420. Lemay Jr, E. P.; and Clark, M. S. 2008b. How the head liberates the heart: Projection of communal responsiveness guides relationship promotion. Journal of Personality and Social Psychology 94(4): 647. Loria, S. 2018. textblob Documentation. Release 0.15 2.