<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>A survey of retracted articles in dentistry. BMC
Research Notes</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1756-0500</issn>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1186/s13104</article-id>
      <title-group>
        <article-title>Understanding and Predicting Retractions of Published Work</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sai Ajay Modukuri</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sarah Rajtmajer</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anna Cinzia Squicciarini</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jian Wu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>C. Lee Giles</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Old Dominion University</institution>
          ,
          <addr-line>Norfolk, VA 23529</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>The Pennsylvania State University</institution>
          ,
          <addr-line>University Park, PA 16802</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <volume>10</volume>
      <issue>1</issue>
      <fpage>819</fpage>
      <lpage>822</lpage>
      <abstract>
        <p>Recent increases in the number of retractions of published papers reflect heightened attention and increased scrutiny in the scientific process motivated, in part, by the replication crisis. These trends motivate computational tools for understanding and assessment of the scholarly record. Here, we sketch the landscape of retracted papers in the Retraction Watch database, a collection of 19k records of published scholarly articles that have been retracted for various reasons (e.g., plagiarism, data error). Using metadata as well as features derived from full-text for a subset of retracted papers in the social and behavioral sciences, we develop a random forest classifier to predict retraction in new samples with 73% accuracy and F1-score of 71%. We believe this study to be the first of its kind to demonstrate the utility of machine learning as a tool for the assessment of retracted work.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The last two decades have seen growing concern in the
scientific community about the integrity of published work
        <xref ref-type="bibr" rid="ref12 ref17 ref33">(Collaboration et al. 2015; Camerer et al. 2018; Klein et al. 2018)</xref>
        and an increase in the number of retractions of published
articles (see Figure 1), in part due to increased scrutiny and
improved oversight
        <xref ref-type="bibr" rid="ref10 ref24 ref9">(Steen, Casadevall, and Fang 2013; Fanelli
2013; Brainard 2018)</xref>
        . Focused studies of the primary
reasons for retraction have suggested that research misconduct
and fraud make up the majority, but also that a sizeable
number of retractions are due to laboratory error, error in
analyses, or inability to submit to reproduction or replication
        <xref ref-type="bibr" rid="ref13 ref27 ref34">(Casadevall, Steen, and Fang 2014; Hesselmann et al. 2017)</xref>
        .
      </p>
      <p>
        Continued attention to and assessment of our confidence
in published work is the cornerstone to efficient scientific
progress, while the sheer volume of research papers
published each year is overwhelming and increasing
        <xref ref-type="bibr" rid="ref8">(Bornmann
and Mutz 2015)</xref>
        . Auditors and stakeholders, including
reviewers, editors, other scientists, and the broader public, seek
indicators and tools to contextualize and evaluate published
findings, but these processes are still largely ad hoc. Proxies
for credibility, such as citations and impact factors, while
widespread, have also been shown to be biassed and flawed
        <xref ref-type="bibr" rid="ref25 ref50 ref7">(Garfield et al. 1994; Seglen 1997; Bordons, Fernández, and
Gómez 2002)</xref>
        . Leading voices have argued for a re-imagining
of scholarship itself
        <xref ref-type="bibr" rid="ref46">(Stodden et al. 2016; Perkel 2018)</xref>
        in
support of greater transparency and verifiability. While it is still
unclear what form they must take, it is clear that
computational tools will play a role in aggregating, sorting, querying,
and evaluating scientific outputs in the future. Our work is
motivated by this view, as we put forward a supervised
approach to determine factors that best predict the retraction of
scholarly work.
      </p>
      <p>
        Here, we study retractions collected by The Center for
Scientific Integrity and included in its Retraction Watch database
(retractionwatch.com;
        <xref ref-type="bibr" rid="ref45">(Oransky and Marcus 2012)</xref>
        ). We
extract a combination of metadata and full-text features that can
separate retracted from non-retracted papers and develop a
classifier to predict retraction in new samples with relatively
high confidence. We focus on research publications in the
social and behavioral sciences in this study, as it is not yet clear
whether and how different research cultures and publishing
norms differentially impact retraction across fields.
• Extract meaningful information from retracted papers in
the social and behavioral sciences as well as from a
complement set of non-retracted papers, from both metadata
and full-text.
• Build a binary classifier to identify the likelihood of a
paper’s retraction given extracted information with 73%
accuracy.
• Identify by ablation studies features and sets of features
that best separate retracted from non-retracted papers.
These insights, we argue, can direct further research into
automated tools for assigning confidence in publication
claims.
      </p>
      <p>The next section highlights related work in the area of
understanding the retraction of scientific publications. Section
3 sketches our primary dataset and preprocessing pipeline.
Section 4 outlines our features pulled from metadata and
full-text documents. Sections 5 and 6 detail our classification
approach and ablation studies. We conclude with a
discussion of our findings and implications for ongoing and future
work.</p>
      <p>2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Several studies have explored the retracted literature within
a specific field of interest.
        <xref ref-type="bibr" rid="ref4">(Bennett et al. 2020)</xref>
        analyses
retracted papers in the obstetrics literature using the Retraction
Watch database and PubMed. They present a breakdown of
various metrics in that dataset, including journal impact
factors, reasons for retractions, number of citations received,
hindex of authors, and type of articles. Other authors have
engaged in similar discussions across a variety of fields,
including chemistry and material Science
        <xref ref-type="bibr" rid="ref18">(Coudert 2019)</xref>
        ,
biomedical sciences
        <xref ref-type="bibr" rid="ref22">(Dal-Ré 2019)</xref>
        , dentistry (Nogueira et al. 2017)
and oncology
        <xref ref-type="bibr" rid="ref45">(Pantziarka and Meheus 2019)</xref>
        . One recent
paper
        <xref ref-type="bibr" rid="ref41 ref45">(Mistry, Grey, and Bolland 2019)</xref>
        surveys publication
rates after the first retraction for biomedical researchers with
multiple retracted publications. The study finds that
publication rates of authors with multiple retractions, most of whom
were associated with scientific misconduct, declined rapidly
after their first retraction, but a small minority continued to
publish regularly. Similarly,
        <xref ref-type="bibr" rid="ref42 ref45">(Mott, Fairhurst, and Torgerson
2019; Suelzer et al. 2019)</xref>
        also found a decline in number of
citations after retraction.
      </p>
      <p>
        Other work supplements data-driven findings from the
analysis of retracted papers in the literature with suggestions
for the community. Authors of
        <xref ref-type="bibr" rid="ref14">(Chan, Jones, and Albarracín
2017)</xref>
        highlight so-called continued influence effects, or the
tendency of false beliefs to persist after correction and
retraction, supporting their discussion through analysis of citations
of retracted papers in downstream research articles. Their
work puts forward a set of best practices for science
communication scholars and practitioners. While,
        <xref ref-type="bibr" rid="ref19 ref23">(Dal-Ré et al.
2020)</xref>
        analyses retractions due to conflict of interest and
argues for greater transparency on the part of both journals and
authors in disclosing financial interests.
      </p>
      <p>
        More closely related to our work, two very recent papers
have begun to suggest possible indicators of low
credibility work.
        <xref ref-type="bibr" rid="ref30">(Horton, Krishna Kumar, and Wood 2020)</xref>
        suggests that Benford’s law can be used to differentiate retracted
academic papers that have employed fraudulent/manipulated
data from other academic papers that have not been retracted.
Specifically, the authors construct several Benford
conformity measures based on the first significant digits contained
in the articles and show deviation for 37 papers containing
known academic fraud. Supporting a broader conversation
about open science and the role of transparency in scientific
processes,
        <xref ref-type="bibr" rid="ref35 ref45">(Lesk, Mattern, and Sandy 2019)</xref>
        study retraction
rates in work with associated shared datasets. Authors found
      </p>
      <sec id="sec-2-1">
        <title>China</title>
        <p>Count 3,211</p>
      </sec>
      <sec id="sec-2-2">
        <title>United States Japan India 1,462 460 392 Germany 314</title>
        <p>that published work with open data has fewer retractions,
signaling higher credibility.</p>
        <p>
          Finally, with the recent outbreak of COVID-19
(SARSCoV-2) and a flurry of scientific output related to the
pandemic, the scientific community has also faced a surge in
the number of retractions in publications related to
COVID19. Work done by
          <xref ref-type="bibr" rid="ref23">Dinis-Oliveira (2020)</xref>
          ; Soltani and Patini
(2020) studies retractions related to COVID-19 and highlight
the need for better scrutiny of published papers.
        </p>
        <p>3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Dataset</title>
      <p>
        At the time of writing, the Retraction Watch database
        <xref ref-type="bibr" rid="ref45">(Oransky and Marcus 2012)</xref>
        has 19,864 records of retracted papers.
Our analysis considered 18,970 records in the dataset from
the year 2001 to 2019. We further downselected 8,087
retractions in the social sciences for classification. Specifically, our
classification task considered papers tagged by the Retraction
Watch organization relating to the following subjects: Health
Sciences (HSC, 5,396 papers), Social Sciences (SOC, 2,651
papers), and Humanities (HUM, 366 papers). More than one
subject may be listed for a given paper.
      </p>
      <p>Each record in the database includes a rich collection of
metadata, including: ‘Title’, ‘Subject’, ‘Institution’,
‘Journal’, ‘Publisher’, ‘Country’, ‘Author’, ‘URLS’, ‘ArticleType’,
‘RetractionDate’, ‘RetractionDOI’, ‘RetractionPubMedID’,
‘OriginalPaperDate’, ‘OriginalPaperDOI’,
‘OriginalPaperPubMedID’, ‘RetractionNature’, ‘Reason’, ‘Paywalled’.</p>
      <p>Approximately 72% of the 8,087 retractions in our dataset
originate from one of five countries (see Table 1). China
contributed 39.7% of the total retractions, followed by the
United States at 18%.</p>
      <p>
        For a majority of articles, limited to no information about
the reason for retraction is available in the dataset. In cases
where that information is given, investigation by external
parties such as journals, institutions, companies, etc.,
contribute to 27.7% of retractions. Malpractices such as
plagiarism, duplication, falsification, fabrication, manipulation
of data represent 37.3% (most malpractice is determined as
the result of an investigation). Other prevalent reasons for
retractions include breach of policy by authors, withdrawals
by authors, and author misconduct (see Table 2). Of 27,471
authors appearing in the dataset, 500 contribute to 3,863 (of
8,087) retractions. Eighty-five authors have ten or more than
ten retractions. This trend echoes similar findings reported
in
        <xref ref-type="bibr" rid="ref10 ref9">(Brainard and You 2018)</xref>
        .
      </p>
      <p>The average time from date of publication to date of
retraction in our dataset is 2 years. However, retraction time varies
by subject. Average retraction time is 2.7 years for papers
in HSC, as compared to 0.8 years in SOC and 1.7 years in
HUM. We also observe a significant variation in the
distribution of reasons for retractions across subjects. For example,</p>
      <sec id="sec-3-1">
        <title>Reason for Retraction</title>
        <p>
          Limited or No Information
Investigation by Journal/Publisher
Investigation by Company/Institution
Duplication of Article
Withdrawal by author
retractions due to limited or no information contributed to
69% of retractions in SOC; the same reason contributed to
only 14% of retractions in HSC. Similar observations were
drawn in a study of retractions in the surgical literature
          <xref ref-type="bibr" rid="ref31">(King
et al. 2018)</xref>
          .
        </p>
        <p>Dataset for Classification Of the 8,087 records, we
further downsampled the records which have entries in PubMed.
This choice to downsample to records available in PubMed
is because abstracts and mesh terms available from PubMed
can be used to search comparable negative samples. Of the
records available in PubMed, we focus on records for which
we can collect full-texts. Finally, we end up with 4,550
records of positive samples along with their full-texts for
the classification task.
3.1</p>
        <sec id="sec-3-1-1">
          <title>Negative Samples Collection</title>
          <p>For classifier development and testing, a comparable set of
non-retracted published articles (negative training samples)
in a one-to-one mapping with retracted articles was collected
such that:
• The negative sample was published within 3 years (before
or after) the year of publication of the retracted sample.
• The negative sample most closely matches the retracted
sample based on keywords (see below for details).</p>
          <p>
            Keywords were retracted from papers using the TextRank
algorithm proposed in
            <xref ref-type="bibr" rid="ref40">(Mihalcea and Tarau 2004)</xref>
            . TextRank
uses a graph-based ranking model, which can be effectively
used to extract keywords from text without the need for
domain knowledge or annotated corpora. Extracted keywords
were used to search for papers on similar topics around the
same year of publication using the PubMed Entrez API1. The
paper selected as the top match to each retracted paper,
published within the three-year time window, was selected for
inclusion in the negative training set. With collected
negative samples and positive samples, our final dataset has 8 744
records.
3.2
          </p>
        </sec>
        <sec id="sec-3-1-2">
          <title>Preprocessing of full-texts</title>
          <p>
            For both the records from Retraction Watch and the records
selected from PubMed, we collected and preprocessed
fulltext PDFs. We experimented with several available
conversion tools. While pdftotext2 worked well for PDF to text
1https://www.ncbi.nlm.nih.gov/home/develop/api/
2https://www.xpdfreader.com/pdftotext-man.html
conversion, it did not structure output in a usable way.
Instead, to extract data from articles in a structured format, we
used the GeneRation Of BIbliographic Data (Grobid)
            <xref ref-type="bibr" rid="ref36">(Lopez
2009)</xref>
            , which can segment PDF papers into TEI format,
allowing programmatic access to various fields and sections of
the paper. The GROBID output is further parsed using
regular expression patterns (GROBID) and downstream feature
extraction/development tasks.
          </p>
          <p>4</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Features</title>
      <p>We use a comprehensive set of features, including publication
metadata and features derived from the full-text of published
papers. Metadata features are pulled through public scholarly
APIs. While, we make use of various mining tools including
GROBID and pdftotext to extract pertinent information from
full-text PDFs of published articles.
4.1</p>
      <sec id="sec-4-1">
        <title>Metadata features</title>
        <p>We leverage the Scopus3, Crossref4 and Semantic Scholar5
datasets and tools to collect key measures related to the
papers in our dataset.</p>
      </sec>
      <sec id="sec-4-2">
        <title>Lead author university rankings GROBID output in</title>
        <p>cludes the author’s first and last names and institutional
affiliations. We use this information when available. When
missing, we search for authors’ affiliation information through
Elsevier API. We then augment the first author’s affiliation
with an affiliation score, calculated using institutional
rankings from Times Higher Education6 as follows:</p>
        <p>Affiliation Score =</p>
      </sec>
      <sec id="sec-4-3">
        <title>Citation and Reference Intents Semantic Scholar also</title>
        <p>provides the intent behind each citation and reference. A
paper can be cited as background, methodology, results, etc.
For a given paper, we count the number of citing papers of
certain intent(s) by querying the paper’s identifiers (title or
DOI) against the Semantic Scholar API. Similarly, we count
the number of references for each intent of the given paper
and use them as features.</p>
        <p>Open access The open-access feature indicates whether
the article can be accessed by any individual without a
paywall. We collect this information from the Elsevier API and
encode this flag as a binary feature.</p>
        <p>Other Features In addition to the features outlined above,
we use other readily available standard metadata including:
(i) subject area in which paper is published; (ii) country of the
primary authors’ affiliations; (iii) the number of references;
(iv) number of authors; and, (v) title (we concatenate title
along with abstract).
4.2</p>
      </sec>
      <sec id="sec-4-4">
        <title>Full-Text Features</title>
        <p>
          While metadata features give an overview of the paper,
fulltext features represent features that are much more
contentspecific. Specifically, we extract test statistics of experiments
from full-text. These features are extracted using PDF
conversion tools followed by various downstream feature extraction
tasks.
?-values ?-values signifies the confidence level of a null
hypothesis based on experiments. Full-texts of published
work can be mined to extract ?-values and various other
test statistics. For this, we use pdftotext to extract textual
information present in full-texts PDFs. Most of the papers in
SBS fields follow standard formats to report ?-values. For
example, ?-values are reported as ?  001, or ? = 01, or
? ¡ 05, etc. We follow methods similar to (
          <xref ref-type="bibr" rid="ref44">Nuĳten et al.
2016</xref>
          ) to extract ?-values using various regex patterns.
        </p>
        <p>Furthermore, we extract other features from the ?-values
identified using the regex patterns such as number of
?values, real-?: defined as the lowest ?-values among all the
extracted ?-values, sign-? 2 f¡  =g: defined as the sign of
the real-?, ?-value range: defined as the difference between
the highest and lowest ?-values extracted from text. Some
scholarly works publish ?-values along with test statistics
such as ANOVA, Chi-squared, etc. We use a binary feature
that indicates whether the ?-value is reported along with a test
statistic is extended-?. For example, ¹200º = 138 ? = 01.</p>
        <p>We use the the number of ?-values with test statistic and
the number of ?-values without test statistics as features. In
the future sections, we refer all the above ?-value related
features as ?-value features rather than referring them
individually.</p>
        <p>Sample Size Sample size is the number of observations
made to determine the statistical significance of a hypothesis.
Similar to ?-value extraction, sample size can be extracted
from a published article using regex patterns. In cases where
test statistics are given, sample sizes can be calculated using
various formulas based on the test statistic used. We use a
combination of regex patterns and test statistic related
formulas to extract sample sizes from a given paper.
Acknowledgements The acknowledgment section of a
published paper may contain funding information. We use
AckExtract to extract named entities using
state-of-theart Named Entity Recognition techniques, followed by a
relation-based entity classifier to determine if the work was
funded by an organization (Wu et al. 2020).</p>
        <p>
          Self Citations Self-citation is common practice within the
scientific community. Authors may cite their earlier works.
The effects of self-citations and their significance for a
paper’s impact factor have been extensively studied
          <xref ref-type="bibr" rid="ref40">(Renata
1977; Wolfgang, Bart, and Balázs 2004)</xref>
          . Authors
publishing in high-impact journals have more self-citations when
compared with authors usually publishing in lower-impact
journals
          <xref ref-type="bibr" rid="ref2">(Anseel et al. 2004)</xref>
          . However, when self-citation
ratios are considered, they observe high-impact journals have
lower self-citation ratios when compared with lower-impact
journals. We extract self-citations from the references
section of full-text by matching author names and calculate the
self-citation ratio. For matching, we used author names in
the title section to compare with the author names in the
references section using a fuzzy string matcher.
        </p>
        <p>Abstract The abstract section provides an overview of
what the article is about and its area of study. Capturing
the abstract information in a meaningful and effective way
as a feature can play an important role in the classification
task. In this work, we have experimented with various word
embeddings to represent abstracts.</p>
        <p>
          Doc2Vec Embeddings: Sentence embeddings learned
via distributed representations are proven to be effective
in sentence classification tasks
          <xref ref-type="bibr" rid="ref34">(Le and Mikolov 2014)</xref>
          .
Here, we experiment with these embeddings available as
Doc2Vec in Gensim library
          <xref ref-type="bibr" rid="ref48">(Řehůřek and Sojka 2010)</xref>
          .
BioSentVec embeddings: Along with Doc2Vec
embeddings, we also experiment with BioSentVec embeddings
proposed by
          <xref ref-type="bibr" rid="ref15 ref45">(Chen, Peng, and Lu 2019)</xref>
          . BioSentVec is
trained on large a large corpus of scholarly articles
available in the PubMed database and clinical notes from
MIMIC- III Clinical Database. The abstracts in our
classification task are from a similar distribution on which
BioSentVec is trained (Since all the records in our dataset
are available in PubMed).
        </p>
        <p>
          SciBERT embeddings: Bidirectional transformers have
achieved state of the art results on most NLP tasks,
including sentence classification. We experiment with
sentence embeddings from SciBERT
          <xref ref-type="bibr" rid="ref3 ref45">(Beltagy, Lo, and Cohan
2019)</xref>
          embeddings obtained via bidirectional transformers
trained on a large corpus of scholarly articles from
Semantic Scholar. In our experiments, we use [CLS] token
embeddings from SciBERT’s output. In cases where
abstract exceeded 512 tokens, we omitted the extra tokens
for embeddings.
        </p>
        <p>TFIDF: Term Frequency-Inverse Document Frequency
(TFIDF) is a popular technique in information retrieval
and machine learning. In our experiments, we use TFIDF
of abstracts with removed stop words removed along with
(a) ROC Curve
(b) Confusion Matrix</p>
        <p>
          words stemmed. We also use TFIDF with reduced
dimensions using TruncatedSVD
          <xref ref-type="bibr" rid="ref26">(Halko, Martinsson, and Tropp
2011)</xref>
          .
        </p>
        <p>5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Classification</title>
      <p>We formulate the task of retraction classification as
follows: given access to a labeled set of training samples,
f¹G8 H8ºg8==1 2 XCA 08= YCA 08=, such that G8 2 R= H8 2 f0 1g
we aim to train a classifier 5 : X ! Y with minimum
classification error on unseen data i.e, XC4BC YC4BC .</p>
      <p>H8 =
0 if retracted,</p>
      <sec id="sec-5-1">
        <title>1 if non retracted</title>
        <p>
          We use random forest classifier
          <xref ref-type="bibr" rid="ref11">(Breiman 2001)</xref>
          to support
interpretability of results and good performance. All of our
experiments were done using 100 trees as we didn’t see much
performance improvements over 100 trees. For experiments
in Table 3, we used TF-IDF for representing abstracts. Note
that we concatenate the title of the paper along with the
abstract as a single feature. To further simplify the model for
interpretability, we decompose the TFIDF matrix using
randomized SVD
          <xref ref-type="bibr" rid="ref26">(Halko, Martinsson, and Tropp 2011)</xref>
          with 10
iterations to 15 dimensions. Randomized SVD is better suited
for sparse matrices such as TFIDF. (We also experimented
with PCA for dimensionality reduction, but dimensionality
reduction using randomized SVD gave better results). For
categorical variables in our dataset, i.e, Subject, Country, we
use target encoding
          <xref ref-type="bibr" rid="ref38">(Micci-Barreca 2001)</xref>
          . Target encoder
takes into account the posterior probability of the target,
given a categorical value and the prior probability of the
target on the entire training set to encode categorical variables.
We report 10-fold cross-validation scores and scores on the
train-test split (85% - 15%), see Table 3. For the train-test
split, we report Area Under the Receiver Operating
Characteristic (AUROC) of 781%. The ROC curve and a heat map
of the confusion matrix are provided in Figure 3.
        </p>
        <p>A closer look at individual decision trees of our random
forests reveals several interesting insights. For example,
certain combinations of countries and subjects combined with
other underlying feature distributions of such as low SJR and
University Rank are more prone to retractions. On the other
hand, certain combinations of countries and subjects with a
high self-citation ratio are less likely to be retracted. This can
be observed in Figure 2. For the purpose of better
visualization, we considered only three features: Countries, Subjects
and SJR, and limited the depth of Decision Tree to 3. These
three features together give an F1 score of 66%. Countries
and Subjects are one-hot encoded for ease of understanding
as opposed to target encoded for the scores in Table 3.</p>
        <p>
          We further visualize a sample with actual and predicted
label as non-retracted using Local Interpretable
ModelAgnostic Explanations(LIME)
          <xref ref-type="bibr" rid="ref49">(Ribeiro, Singh, and Guestrin
2016)</xref>
          to present the effectiveness of our classifier. LIME
explains an individual prediction by perturbing a sample and
observing how the prediction changes around the given
sample’s perturbations. From Figure 4, we can observe that the
non-retracted sample has more than seven authors with the
primary author’s affiliation, located in Italy. The paper has
more than 40 references, which was cited more than once as
methodology and a self-citation ratio greater than 017. The
SJR score of the journal where the paper is published falls
in the interval »062 12¼. All these attributes contributed
towards non-retracted classification confidence. While the
overall prediction is non-retracted, having no funding agency
acknowledged, no sample size information, and citation_next
value greater than eight are seen as attributes that could lead
to retraction. Note that this visual analysis is particular to a
sample and does not represent the global feature importance,
and is meant for a high-level intuition of how various features
can meaningfully impact a published work’s confidence.
We completed an ablation study to identify features (or
combinations of features) that are instrumental in identifying
retracted papers. Table 4 shows the result of this
investigation. Metadata features alone give an F1 score of 67%,
while full-text features alone result in an F1 score of 63%.
Combined together, metadata and full-text features help
improve performance to an F1 score of 71%. The importance of
full-text features can also be observed by excluding abstract,
self-citations, and ?-value features individually. Excluding
abstract, self-citations ratio, and ?-value separately doesn’t
lead to a significant drop in F1-score, but together they drop
the F1-score to 677%.
        </p>
        <p>Accuracy Precision Recall
F1
Individual Features:
Cite. Next
Uni. Rank
Open Access
?-Value features
Author Cnt.</p>
        <p>Self-Cite.</p>
        <p>Abstract
Funded
Subject
Country
Overall Features:
Metadata
Full-text
All Features
Particular Feature Excluded:</p>
        <p>We examine the importance of each feature by excluding
each from the overall features and also measuring the
performance of each feature individually. In Table 4, the country of
the primary author has the most predictive power. Excluding
the country from the overall feature list hurts the F1-score
significantly. Individually, SJR, abstract, country give the
best performance out of all metadata features. Similarly, the
TFIDF of the abstracts gives the best performance of all the
full-text features. We reduced the dimension of the TFIDF
vector from 34,000 to 15 using Truncated SVD without a
significant drop in performance. The best score is achieved
by using all the features.</p>
        <p>In regards to individual features, from Table 4 we note that
features such as self-citation alone cannot achieve any
separability. However, when combined with other features, they
provide predictive power to the classifier Figure 2.
University rank individually provides almost no separability. The
university rank of 8 535 records is set to default value 0;
this suggests exploring better methods to encode affiliation
information. 3 130 records in our dataset have open access
(open access flag set to 1), this feature exhibits almost zero
correlation( 0017) with retracted vs. non-retracted label.
This suggests that open access of published articles is not an
indicator of a scholarly work’s confidence.</p>
        <p>7</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>In this work, we present initial evidence for the utility of
supervised approaches for the assessment of retracted
scholarly work. Using metadata as well as features derived from
the full-text for a subset of retracted papers in the social
and behavioral sciences, we develop a random forest
classifier to predict retraction in new samples. Looking ahead,
we might assume that signals of credibility and concern
will vary across scientific domains. And that further studies
in ML-enabled understanding of retraction will, therefore,
likely need to be undertaken by interdisciplinary teams. We
suggest that yet more sophisticated features capturing
argument structure, experimental conditions, and corroborations
across the literature will be important steps for work in this
direction.</p>
      <p>8</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgements</title>
      <p>This work was partially supported by the Defense
Advanced Research Projects Agency cooperative agreement No.
W911NF-19-2- 0272. The ideas in this paper do not
necessarily reflect the position or the policy of the Government,
and no official endorsement should be inferred.
Renata, T. 1977. SELF-CITATIONS IN SCIENTIFIC
LITERATURE. Journal of Documentation 33(4): 251–265. ISSN
0022-0418. doi:10.1108/eb026644. URL https://doi.org/10.1108/
eb026644.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Aksnes</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Langfeldt</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ; and Wouters,
          <string-name>
            <surname>P.</surname>
          </string-name>
          <year>2019</year>
          . Citations, Citation Indicators, and Research Quality:
          <article-title>An Overview of Basic Concepts and Theories</article-title>
          .
          <source>SAGE Open</source>
          <volume>9</volume>
          :
          <fpage>215824401982957</fpage>
          . doi:
          <volume>10</volume>
          .1177/ 2158244019829575.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Anseel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Duyck</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ; De Baene, W.; and Brysbaert,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <year>2004</year>
          .
          <article-title>Journal Impact Factors and Self-Citations: Implications for Psychology Journals</article-title>
          .
          <volume>59</volume>
          (
          <issue>1</issue>
          ):
          <fpage>49</fpage>
          -
          <lpage>51</lpage>
          . doi:
          <volume>10</volume>
          .1037/
          <fpage>0003</fpage>
          -
          <lpage>066X</lpage>
          .
          <year>59</year>
          .1.49.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Beltagy</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ; Lo,
          <string-name>
            <given-names>K.</given-names>
            ; and
            <surname>Cohan</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <year>2019</year>
          .
          <article-title>SciBERT: A Pretrained Language Model for Scientific Text</article-title>
          .
          <fpage>3615</fpage>
          -
          <lpage>3620</lpage>
          . Hong Kong,
          <article-title>China: Association for Computational Linguistics</article-title>
          . doi:
          <volume>10</volume>
          .18653/ v1/
          <fpage>D19</fpage>
          -1371. URL https://www.aclweb.org/anthology/D19-1371.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Bennett</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Chambers</surname>
            ,
            <given-names>L. M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Al-Hafez</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Michener</surname>
            ,
            <given-names>C. M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Falcone</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Yao</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Berghella</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <year>2020</year>
          .
          <article-title>Retracted articles in the obstetrics literature: lessons from the past to change the future</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <source>American Journal of Obstetrics &amp; Gynecology MFM</source>
          <volume>2</volume>
          (
          <issue>4</issue>
          ):
          <fpage>100201</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>ISSN</surname>
          </string-name>
          2589-
          <fpage>9333</fpage>
          . doi:
          <volume>10</volume>
          .1016/j.ajogmf.
          <year>2020</year>
          .100201. URL http: //www.sciencedirect.com/science/article/pii/S2589933320301701.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Bordons</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Fernández</surname>
          </string-name>
          , M. T.; and
          <string-name>
            <surname>Gómez</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <year>2002</year>
          .
          <article-title>Advantages and limitations in the use of impact factor measures for the assessment of research performance</article-title>
          .
          <source>Scientometrics</source>
          <volume>53</volume>
          (
          <issue>2</issue>
          ):
          <fpage>195</fpage>
          -
          <lpage>206</lpage>
          . ISSN 1588-
          <fpage>2861</fpage>
          . doi:
          <volume>10</volume>
          .1023/A:1014800407876. URL https://doi.org/10.1023/A:
          <fpage>1014800407876</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Bornmann</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ; and Mutz,
          <string-name>
            <surname>R.</surname>
          </string-name>
          <year>2015</year>
          .
          <article-title>Growth rates of modern science: A bibliometric analysis based on the number of publications and cited references</article-title>
          .
          <source>Journal of the Association for Information Science and Technology</source>
          <volume>66</volume>
          (
          <issue>11</issue>
          ):
          <fpage>2215</fpage>
          -
          <lpage>2222</lpage>
          . doi:
          <volume>10</volume>
          .1002/asi.23329. URL https://asistdl.onlinelibrary.wiley.com/doi/abs/10.1002/asi.23329.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Brainard</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Rethinking retractions</article-title>
          .
          <source>Science</source>
          <volume>362</volume>
          (
          <issue>6413</issue>
          ):
          <fpage>390</fpage>
          -
          <lpage>393</lpage>
          . ISSN 0036-
          <fpage>8075</fpage>
          . doi:
          <volume>10</volume>
          .1126/science.362.6413.390. URL https://science.sciencemag.org/content/362/6413/390.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Brainard</surname>
          </string-name>
          , J.; and
          <string-name>
            <surname>You</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>What a massive database of retracted papers reveals about science publishing's 'death penalty'</article-title>
          .
          <source>Science</source>
          <volume>25</volume>
          (
          <issue>1</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Breiman</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <year>2001</year>
          .
          <article-title>Random forests</article-title>
          .
          <source>Machine Learning</source>
          <volume>45</volume>
          (
          <issue>1</issue>
          ):
          <fpage>5</fpage>
          -
          <lpage>32</lpage>
          . ISSN 08856125. doi:
          <volume>10</volume>
          .1023/A:1010933404324. URL https://doi.org/10.1023/A:
          <fpage>1010933404324</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Camerer</surname>
            ,
            <given-names>C. F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Dreber</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Holzmeister</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Ho</surname>
          </string-name>
          , T.-H.;
          <string-name>
            <surname>Huber</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Johannesson,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Kirchler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Nave</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            ;
            <surname>Nosek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. A.</given-names>
            ;
            <surname>Pfeiffer</surname>
          </string-name>
          ,
          <string-name>
            <surname>T.</surname>
          </string-name>
          ; et al.
          <year>2018</year>
          .
          <article-title>Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015</article-title>
          .
          <source>Nature Human Behaviour</source>
          <volume>2</volume>
          (
          <issue>9</issue>
          ):
          <fpage>637</fpage>
          -
          <lpage>644</lpage>
          . doi:
          <volume>10</volume>
          .1038/s41562-018-0399-z.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Casadevall</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Steen</surname>
          </string-name>
          , R. G.; and
          <string-name>
            <surname>Fang</surname>
            ,
            <given-names>F. C.</given-names>
          </string-name>
          <year>2014</year>
          .
          <article-title>Sources of error in the retracted scientific literature</article-title>
          .
          <source>The FASEB Journal</source>
          <volume>28</volume>
          (
          <issue>9</issue>
          ):
          <fpage>3847</fpage>
          -
          <lpage>3855</lpage>
          . doi:
          <volume>10</volume>
          .1096/fj.14-
          <fpage>256735</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Chan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Albarracín</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Countering false beliefs: An analysis of the evidence and recommendations of best practices for the retraction and correction of scientific misinformation</article-title>
          ,
          <fpage>341</fpage>
          -
          <lpage>350</lpage>
          . Oxford University Press. doi:
          <volume>10</volume>
          .1093/oxfordhb/ 9780190497620.013.37.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Peng</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>BioSentVec: creating sentence embeddings for biomedical texts</article-title>
          .
          <source>2019 IEEE International Conference on Healthcare Informatics (ICHI) doi:10</source>
          .1109/ichi.
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          8904728. URL http://dx.doi.org/10.1109/ICHI.
          <year>2019</year>
          .
          <volume>8904728</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <surname>Collaboration</surname>
            ,
            <given-names>O. S.</given-names>
          </string-name>
          ; et al.
          <year>2015</year>
          .
          <article-title>Estimating the reproducibility of psychological science</article-title>
          .
          <source>Science</source>
          <volume>349</volume>
          (
          <issue>6251</issue>
          ).
          <source>ISSN 0036-8075</source>
          . doi:
          <volume>10</volume>
          .1126/science.aac4716. URL https://science.sciencemag.org/ content/349/6251/aac4716.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <surname>Coudert</surname>
            ,
            <given-names>F.-X.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>Correcting the Scientific Record: Retraction Practices in Chemistry and Materials Science</article-title>
          .
          <source>Chemistry of Materials</source>
          <volume>31</volume>
          :
          <fpage>3593</fpage>
          -
          <lpage>3598</lpage>
          . doi:
          <volume>10</volume>
          .1021/acs.chemmater.9b00897.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>Dal-Ré</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ; Bouter,
          <string-name>
            <given-names>L. M.</given-names>
            ;
            <surname>Moher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ; and
            <surname>Marušić</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <article-title>Mandatory disclosure of financial interests of journals and editors</article-title>
          .
          <source>BMJ 370. doi:10</source>
          .1136/bmj.m2872. URL https://www.bmj.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>com/content/370/bmj.m2872.</mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <surname>Dal-Ré</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>Analysis of retracted articles on medicines administered to humans</article-title>
          .
          <source>British Journal of Clinical Pharmacology</source>
          <volume>85</volume>
          (
          <issue>9</issue>
          ):
          <fpage>2179</fpage>
          -
          <lpage>2181</lpage>
          . doi:https://doi.org/10.1111/bcp.14021. URL https: //bpspubs.onlinelibrary.wiley.com/doi/abs/10.1111/bcp.14021.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <surname>Dinis-Oliveira</surname>
            ,
            <given-names>R. J.</given-names>
          </string-name>
          <year>2020</year>
          .
          <article-title>COVID-19 research: pandemic versus “paperdemic”, integrity, values and risks of the “speed science”</article-title>
          .
          <source>Forensic Sciences Research</source>
          <volume>5</volume>
          (
          <issue>2</issue>
          ):
          <fpage>174</fpage>
          -
          <lpage>187</lpage>
          . doi:
          <volume>10</volume>
          .1080/ 20961790.
          <year>2020</year>
          .1767754. URL https://doi.org/10.1080/20961790.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <surname>Fanelli</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>2013</year>
          .
          <article-title>Why Growing Retractions Are (Mostly) a Good Sign</article-title>
          .
          <source>PLoS Medicine</source>
          <volume>10</volume>
          (
          <issue>12</issue>
          )
          <article-title>: e1001563</article-title>
          .
          <source>ISSN 1549-1676</source>
          . doi:
          <volume>10</volume>
          .1371/journal.pmed.1001563. URL https://dx.plos.
          <source>org/10</source>
          .1371/ journal.pmed.
          <volume>1001563</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <surname>Garfield</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ; et al.
          <year>1994</year>
          .
          <article-title>The impact factor</article-title>
          .
          <source>Current contents</source>
          <volume>25</volume>
          (
          <issue>20</issue>
          ):
          <fpage>3</fpage>
          -
          <lpage>7</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <surname>Halko</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Martinsson</surname>
            ,
            <given-names>P. G.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Tropp</surname>
            ,
            <given-names>J. A.</given-names>
          </string-name>
          <year>2011</year>
          .
          <article-title>Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions</article-title>
          .
          <source>SIAM Rev</source>
          .
          <volume>53</volume>
          (
          <issue>2</issue>
          ):
          <fpage>217</fpage>
          -
          <lpage>288</lpage>
          . ISSN 0036-
          <fpage>1445</fpage>
          . doi:
          <volume>10</volume>
          .1137/090771806. URL https://doi.org/10.1137/090771806.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <surname>Hesselmann</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Graf</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Schmidt</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ; and Reinhart,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <article-title>The visibility of scientific misconduct: A review of the literature on retracted journal articles</article-title>
          .
          <source>Current sociology 65</source>
          <volume>(6)</volume>
          :
          <fpage>814</fpage>
          -
          <lpage>845</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <source>doi:10</source>
          .1177/0011392116663807.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <string-name>
            <surname>Horton</surname>
            ,
            <given-names>J.; Krishna</given-names>
          </string-name>
          <string-name>
            <surname>Kumar</surname>
          </string-name>
          , D.; and
          <string-name>
            <surname>Wood</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2020</year>
          .
          <article-title>Detecting academic fraud using Benford law: The case of Professor James Hunton</article-title>
          .
          <source>Research Policy</source>
          <volume>49</volume>
          (
          <issue>8</issue>
          ):
          <fpage>104084</fpage>
          .
          <string-name>
            <surname>ISSN</surname>
          </string-name>
          0048-
          <fpage>7333</fpage>
          . doi: j.respol.
          <year>2020</year>
          .104084. URL http://www.sciencedirect.com/science/ article/pii/S0048733320301621.
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <string-name>
            <surname>King</surname>
            ,
            <given-names>E. G.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Oransky</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ; Sachs,
          <string-name>
            <given-names>T. E.</given-names>
            ;
            <surname>Farber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ;
            <surname>Flynn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. B.</given-names>
            ;
            <surname>Abritis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ;
            <surname>Kalish</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            ; and
            <surname>Siracuse</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. J.</surname>
          </string-name>
          <year>2018</year>
          .
          <article-title>Analysis of retracted articles in the surgical literature</article-title>
          .
          <source>The American Journal of Surgery</source>
          <volume>216</volume>
          (
          <issue>5</issue>
          ):
          <fpage>851</fpage>
          -
          <lpage>855</lpage>
          . doi:
          <volume>10</volume>
          .1016/j.amjsurg.
          <year>2017</year>
          .
          <volume>11</volume>
          .033.
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <string-name>
            <surname>Kirkpatrick</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>Search Engine's Author Profiles Now Driven By Influence Metrics</article-title>
          . Communications of ACM URL https://cacm.acm.org/news/201387-search
          <article-title>-enginesauthor-profiles-now-driven-by-influence-metrics/fulltext.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <string-name>
            <surname>Klein</surname>
            ,
            <given-names>R. A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Vianello</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Hasselman</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Adams</surname>
            ,
            <given-names>B. G.</given-names>
          </string-name>
          ;
          <article-title>Adams Jr</article-title>
          , R. B.;
          <string-name>
            <surname>Alper</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; Aveyard,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Axt</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. R.</surname>
          </string-name>
          ; Babalola, M. T.; Bahník, Š.; et al.
          <year>2018</year>
          .
          <article-title>Many Labs 2: Investigating variation in replicability across samples and settings</article-title>
          .
          <source>Advances in Methods and Practices in Psychological Science</source>
          <volume>1</volume>
          (
          <issue>4</issue>
          ):
          <fpage>443</fpage>
          -
          <lpage>490</lpage>
          . doi:
          <volume>10</volume>
          .1177/2515245918810225.
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          <string-name>
            <surname>Le</surname>
            ,
            <given-names>Q. V.</given-names>
          </string-name>
          ; and Mikolov,
          <string-name>
            <surname>T.</surname>
          </string-name>
          <year>2014</year>
          .
          <article-title>Distributed Representations of Sentences and Documents</article-title>
          .
          <source>In International Conference on Machine Learning. doi:10.5555/3044805</source>
          .3045025.
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          <string-name>
            <surname>Lesk</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Mattern</surname>
            ,
            <given-names>J. B.</given-names>
          </string-name>
          ; and Sandy,
          <string-name>
            <surname>H. M.</surname>
          </string-name>
          <year>2019</year>
          .
          <article-title>Are papers with open data more credible? An analysis of open data availability in retracted PLoS articles</article-title>
          . In International Conference on Information,
          <volume>154</volume>
          -
          <fpage>161</fpage>
          . Springer. doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -15742-5_
          <fpage>14</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          <string-name>
            <surname>Lopez</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <year>2009</year>
          .
          <article-title>GROBID: Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          <source>ECDL'09</source>
          ,
          <fpage>473</fpage>
          -
          <lpage>474</lpage>
          . Berlin, Heidelberg: Springer-Verlag.
          <source>ISBN 3- 642-04345-3</source>
          ,
          <fpage>978</fpage>
          -3-
          <fpage>642</fpage>
          -04345-1. doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>642</fpage>
          -04346- 8_
          <fpage>62</fpage>
          . URL http://dl.acm.org/citation.cfm?id=
          <volume>1812799</volume>
          .
          <fpage>1812875</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          <string-name>
            <surname>Micci-Barreca</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>2001</year>
          .
          <article-title>A preprocessing scheme for highcardinality categorical attributes in classification and prediction problems</article-title>
          .
          <source>ACM SIGKDD Explorations Newsletter</source>
          <volume>3</volume>
          (
          <issue>1</issue>
          ):
          <fpage>27</fpage>
          -
          <lpage>32</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          <source>doi:10.1145/507533</source>
          .507538.
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          <string-name>
            <surname>Mihalcea</surname>
            , R.; and Tarau,
            <given-names>P.</given-names>
          </string-name>
          <year>2004</year>
          .
          <article-title>Textrank: Bringing order into text</article-title>
          .
          <source>In Proceedings of the 2004 conference on empirical methods in natural language processing</source>
          ,
          <volume>404</volume>
          -
          <fpage>411</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          <string-name>
            <surname>Mistry</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Grey</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ; and Bolland,
          <string-name>
            <surname>M. J.</surname>
          </string-name>
          <year>2019</year>
          .
          <article-title>Publication rates after the first retraction for biomedical researchers with multiple retracted publications</article-title>
          .
          <source>Accountability in Research</source>
          <volume>26</volume>
          (
          <issue>5</issue>
          ):
          <fpage>277</fpage>
          -
          <lpage>287</lpage>
          . doi:
          <volume>10</volume>
          .1080/08989621.
          <year>2019</year>
          .1612244. URL https://doi.org/ 10.1080/08989621.
          <year>2019</year>
          .1612244. PMID:
          <volume>31025884</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          <string-name>
            <surname>Mott</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Fairhurst</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Torgerson</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>Assessing the impact of retraction on the citation of randomized controlled trial reports: an interrupted time-series analysis</article-title>
          .
          <source>Journal of Health Services Research &amp; Policy</source>
          <volume>24</volume>
          (
          <issue>1</issue>
          ):
          <fpage>44</fpage>
          -
          <lpage>51</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          <source>doi:10</source>
          .1177/1355819618797965. URL https://doi.org/10.1177/ 1355819618797965. PMID:
          <volume>30249142</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          <string-name>
            <surname>Nuĳten</surname>
            ,
            <given-names>M. B.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Hartgerink</surname>
            , C. H.; van Assen,
            <given-names>M. A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Epskamp</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Wicherts</surname>
            ,
            <given-names>J. M.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>The prevalence of statistical reporting errors in psychology (</article-title>
          <year>1985</year>
          -
          <fpage>2013</fpage>
          ).
          <source>Behavior research methods 48</source>
          <volume>(4)</volume>
          :
          <fpage>1205</fpage>
          -
          <lpage>1226</lpage>
          . doi:
          <volume>10</volume>
          .3758/s13428-015-0664-2.
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          <string-name>
            <surname>Oransky</surname>
            ,
            <given-names>I.;</given-names>
          </string-name>
          and
          <string-name>
            <surname>Marcus</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2012</year>
          .
          <article-title>Retraction watch</article-title>
          . URL http: //retractiondatabase.org/RetractionSearch.aspx? Pantziarka,
          <string-name>
            <given-names>P.</given-names>
            ; and
            <surname>Meheus</surname>
          </string-name>
          ,
          <string-name>
            <surname>L.</surname>
          </string-name>
          <year>2019</year>
          .
          <article-title>Journal retractions in oncology: a bibliometric study</article-title>
          .
          <source>Future Oncology</source>
          <volume>15</volume>
          (
          <issue>31</issue>
          ):
          <fpage>3597</fpage>
          -
          <lpage>3608</lpage>
          . doi:
          <volume>10</volume>
          .2217/fon-2019-0233. URL https://doi.org/10.2217/fon-2019- 0233. PMID:
          <volume>31659916</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          <string-name>
            <surname>Perkel</surname>
            ,
            <given-names>J. M.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>A toolkit for data transparency takes shape</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          <source>Nature</source>
          <volume>560</volume>
          (
          <issue>7718</issue>
          ):
          <fpage>513</fpage>
          -
          <lpage>516</lpage>
          . doi:
          <volume>10</volume>
          .1038/d41586-018-05990-5.
        </mixed-citation>
      </ref>
      <ref id="ref48">
        <mixed-citation>
          <string-name>
            <surname>Řehůřek</surname>
            , R.; and Sojka,
            <given-names>P.</given-names>
          </string-name>
          <year>2010</year>
          .
          <article-title>Software Framework for Topic Modelling with Large Corpora</article-title>
          .
          <source>In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks</source>
          ,
          <volume>45</volume>
          -
          <fpage>50</fpage>
          . Valletta, Malta: ELRA. http://is.muni.cz/publication/884893/en.
        </mixed-citation>
      </ref>
      <ref id="ref49">
        <mixed-citation>
          <string-name>
            <surname>Ribeiro</surname>
          </string-name>
          , M. T.;
          <string-name>
            <surname>Singh</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Guestrin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>"Why Should I Trust You?": Explaining the Predictions of Any Classifier</article-title>
          .
          <source>In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source>
          , San Francisco, CA, USA,
          <year>August</year>
          13-
          <issue>17</issue>
          ,
          <year>2016</year>
          ,
          <fpage>1135</fpage>
          -
          <lpage>1144</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>N16</fpage>
          -3020.
        </mixed-citation>
      </ref>
      <ref id="ref50">
        <mixed-citation>
          <string-name>
            <surname>Seglen</surname>
            ,
            <given-names>P. O.</given-names>
          </string-name>
          <year>1997</year>
          .
          <article-title>Why the impact factor of journals should not be used for evaluating research</article-title>
          .
          <source>Bmj</source>
          <volume>314</volume>
          (
          <issue>7079</issue>
          ):
          <fpage>497</fpage>
          . doi:
          <volume>10</volume>
          .1136/bmj.314.7079.497.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>