Accelerating literature screening for systematic literature
                         reviews with Large Language Models – development,
                         application, and first evaluation of a solution
                         Paul Herbst and Henning Baars

                         University of Stuttgart, Chair of Information Systems I, Germany

                                         Abstract
                                         A systematic literature review (SLR) is a cornerstone of any academic endeavor. Nevertheless, literature
                                         reviews are time-consuming, arduous and a regular point of contention. A selection of adequate keywords
                                         for a database search that casts a net, that is not too wide and not too narrow, and the selection of filtering
                                         criteria in particular cause difficulties. The application of Machine Learning (ML) and Natural Language
                                         Processing (NLP) to support these tasks has been proposed before. But the emergence of Large Language
                                         Models (LLMs) and Generative Pretrained Transformer models (GPT) bring new options for automation
                                         that might capture semantic details that elude former approaches. We discuss application options for the
                                         different steps of a literature review and propose, implement, and test a solution for screening large
                                         amounts of abstracts in a short amount of time. Our initial results suggest a vast automation potential,
                                         despite some risks and limitations that have to be further navigated.

                                         Keywords
                                         machine learning, large language model, abstract screening, systematic literature review 1


                         1. Introduction
                         Literature reviews provide the foundation for any research project. In some cases, they are used to
                         contribute the related work or the conceptual foundations for a specific research project, in others,
                         the literature review stands on its own [1–5] – a standalone literature review [1, 3]. The latter
                         approach is particularly suited for broader topics with hundreds or thousands of relevant papers
                         that warrant a separate quantitative analysis. Unsupported, the related tasks can cost several
                         person-years, a large portion of which is the identification of relevant papers alone. The literature
                         gives examples of costs that go up as high as 100.000 USD and beyond [6].
                            Usually, literature reviews are done with keyword searches in literature databases, although in
                         some cases, all publications of a defined outlet are scanned manually [5]. It has been suggested
                         before to apply Natural Language Processing (NLP) methods and/or Machine Learning (ML)
                         techniques to partly automate this step. The literature already demonstrates some promising results.
                         However, “classical” NLP approaches come with built-in limitations esp. because of the equivocality,
                         vagueness, ambiguity, and context-dependencies of human language [7, 8]. A promising approach
                         to handle these challenges are transformer-based large language models (LLMs). First introduced in
                         2019 they have recently shown unprecedented results in a slew of NLP tasks [9–11], and it is
                         therefore plausible to apply them to literature reviews as well. A direct application of state-of-the-
                         art LLMs, however (e. g. ChatGPT on top of a GPT-4 foundational model), is currently still
                         delivering sobering results: The models make up authors, years, and publications or present papers
                         that do not fit the actual subject [12]. However, there are alternative ways to tap into the potential
                         of LLMs that deliver better results. In this paper we show a multistep approach that uses text-
                         embeddings or contextual embeddings to create an initial classification of the paper abstracts using
                         established machine learning approaches. Text-embeddings are vectors, generated by an LLM [13]
                         for capturing the semantics of a word, sentence or paragraph. Following this initial classification is
                         the usage of the natural language understanding capability of an LLM for the final selection using
                         few-shot learning.

                         1
                           LWDA: Learing, Knowledge, Data, Analysis 2023, October 09–11, 2023, Marburg, Germany
                                      © 2023 Copyright for this paper by its authors.
                                      Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                      CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
   Therefore, our research question is: How can LLMs be exploited for an automatic screening of
abstracts in a SLR?

2. Conceptual Foundations
There is a plethora of literature on how to conduct a systematic literature review (e.g. [1–5]). In the
following, we go by the structure suggested by vom Brocke et al. [5]: They distinguish between 5
phases that are applied cyclically [5]: 1. definition of review scope 2. conceptualization of the topic,
3. literature search (keyword-based or by screening all papers), 4. literature analysis and review, 5.
research agenda. While we experimented with all five phases with various LLMs, we so far only got
reasonable results for phase 3 which will be our focus here.
    In phase 3, the choice of pertinent keywords is a central problem [3]. The alternative is scanning
all publications from all relevant outlets individually, which is often not feasible due to the required
time. Here, the possibility to apply NLP methods as an automation option comes into play.
    Traditionally, NLP usually starts with the preprocessing of the text that among others can
include steps for the removal of “irrelevant” stop-words, syntactical corrections, lemmatization and
stemming, i.e. reducing words to their grammar-independent core, a part-of-speech tagging that
marks the grammatical role of words and phrases, and the use of thesauri or ontologies to deal with
homonyms or synonyms [14, 15]. After that, each unit of text is characterized by a selection of
words that sets it apart from the rest, usually with the so called “TF-IDF” metric [16]. The sequence
or deeper meaning of the words is discarded; hence this is called a “bag of words” (BoW) approach.
    More recent NLP approaches use so called embeddings which are vectors of numbers that are
attributed to a unit of text (a string of characters or words – a “token”, a sentence or a document)
and thereby position the text in a “semantic space”, i.e. the vector represents and places the texts
meaning [13, 17]. Besides calculating vectors based on the above-mentioned TF-IDF approach, there
are two other techniques to create word embeddings [17]:

       1. Static Embeddings can be generated using pre-trained models. While there are some
          large pretrained Static Word embedding models like Googles Word2Vec for domain-
          specific texts, it is also possible to train own models. The vectors learned can then be
          used to measure syntactic and semantic word similarity [18].
       2. Contextualized Embeddings like ELMo, BERT and GPT-3 are pre-trained models that
          compute embeddings for a sentence dynamically, taking the context of a word into
          account [19].

    Among others, the embeddings can be applied in similarity searches, for document retrieval and
entity extraction, as well as for classification or clustering applications.
    Contextualized embeddings can be produced with a “transformer” architecture, a type of
artificial neural network that was originally designed for a transformation of sequences into new
sequences (seq2seq), e. g. for language translation. A specialty of transformers is that they take a
large sequence (a “context window”) of text into account at once, calculate the relative importance
of all tokens to all other tokens (“attention”/”self-attention”) from multiple angles (“multiple
attention heads”), and are trained by predicting omitted or subsequent tokens [20]. Recent pre-
trained models (“generative pretrained-models", GPTs) have several billion to a few trillion
weight/parameters and are trained with enormous text-corpuses. The GPTs are meant to represent
foundational models (e. g. OpenAI GPT-3.5, OpenAI GPT-4, Google Bard, Meta Llama/Llama 2 etc.)
that can be “fine-tuned” and applied for various “down-stream tasks”, like ChatGPT for chat.
Pretrained models can be accessed directly or via Application Programming Interfaces (APIs).

3. Related Work
To fathom the state of the art in applying NLP for the automation of literature review tasks in
general and literature scanning in particular, we conducted a “traditional” systematic literature
review (according to the recommendations of vom Brocke et al. [5]). We used the four databases
AIS library (1.287 hits, 5 relevant), IEEE-Xplore (216 hits, 8 relevant), ACM digital library (205 hits, 8
relevant) and Web of Science (937 hits, 29 relevant) with the search string “systematic literature
review” AND (automated OR automation OR “large language model” OR llm OR “natural language
processing” OR nlp). The resulting pre-selection was deduplicated. After removing papers without
full text access and following a more detailed screening, this selection was narrowed down to a total
of 21 relevant papers. The following is an overview of the topics studied in those papers.
    While we are focusing on the abstract screening step of the SLR, there are more steps that have
the potential to be supported by machine learning. Torre-Lopez et al. [21] provided a detailed report
over those possibilities for the different phases of the SLR. One of them being the possibility of
supporting the generation of the search string [22, 23]. Additionally, there have been multiple
studies conducted about the applied NLP techniques, trends, and challenges [6, 24–28].
Furthermore, there have been studies of the automation potential in certain domains e. g. the
clinical domain [29, 30].
    The possibility to use Machine Learning and NLP algorithms to decrease the effort needed for
conducting a SLR has been studied for some time. Initial work was conducted by Cohen et al. [31]
using BoW representations of abstracts in combination with Machine Learning. They also
introduced a measurement scale that allows to rank models against each other, namely “work saved
over      sampling”      (WSS).     It   is     a     weighted     variant     of    the    F-measure
(2*precision*recall/(precision+recall)) with a threshold of 0.95 for the recall (WSS@95) [31].
WSS@95 is still widely used to benchmark abstract screener models against each other [32, 33],
despite of criticism that the ratio of relevant papers in the test-set influences the maximum score
that can be achieved using this measure [34].
    Later studies also mainly focused on traditional NLP techniques, esp. based on BoW and TF-IDF
approaches [6, 35–37]. Due to the mentioned shortcomings of these approaches, they still lack the
contextual awareness that Transformer based neural networks provide [19]. Transformers deliver
results that are more nuanced by being able to understand the semantics of a text [38]. The only
study using contextual embeddings that was found by the SLR was conducted by Alchokr et al. [38]
who claimed to have achieved relatively high precision. Yet they were not able to meet the recall
score of 0.95 for relevant abstracts proposed by Cohen et al. [31].
    In summary, the research into the application of transformer-generated embeddings is still in its
infancy with only one relevant source. We build up on these ideas by combining the embeddings
with a classification model, namely a Balanced Random Forest, a GPT-based Vector Embedding
Model as well as augmenting the initial results with a final classification using a direct application
of a GPT-based LLM.
    To allow for accessibility and reproducibility [21, 33] we publish our source code on GitHub.2

4. Methodology and Solution Design
For answering our research question, we utilized the design science approach [39–41]. We followed
the recommendations of Österle at al. [41] who distinguish between the phases of analysis, drafting,
and evaluation, although our evaluation is so far still of a preliminary nature.

       4.1. Design phase – requirements and approach
   As for the design, we identified four core requirements that a tool for supporting or automating
the abstract screening process of an SLR should fulfill. These requirements are:


2
    https://github.com/paul-herbst/llm-for-slr
Table 1
Requirements
 ID    Requirement
       Reproducibility
 R1
       a core aim of a SLR is to produce transparent, reliable, and valid results [5]
       Very low percentage of false negatives
 R2
       not omitting relevant papers is a core concern of a SLR [31]
       Low percentage of false positives
 R3    the aim of the automation needs to be to reduce the manual work as much as possible,
       therefore avoid false positives
       Efficiency
 R4    the solution should apply computing resources in a frugal manner, esp. avoiding the
       application of LLMs in a large-scale manner if that can be avoided


    While standard literature database queries somewhat fulfill R1 and R2, they often generate a high
percentage of false positives [42]. With regards to our own literature search for this paper, out of a
total of 2.465 hits, only 50 (2%) were at least partially relevant to our SLR topic (21 after further
refinement).
    Our search highlights the shortcomings of a keyword-based database query: The database is not
capable of processing the difference between our intention to find papers about the automation of
the SLR, so it returns all the papers that include a SLR about the topic of automating something.
Similar issues arise whenever the intended subject of the query is not the actual subject of the paper
but rather appears in its context, in an example, or in the framing of a different subject or when a
certain degree of abstraction is intended. Other examples include queries for “analytics and artificial
intelligence as a research subject rather than a research method”, for the “efficiency of method x in
general but not only in a specified domain” etc. We deem this as being a typical problem, that
results from the inability to address semantic context – an issue that also arises with a traditional
NLP BoW approach.
    We chose to counter this with an LLM-based approach that incorporates both the language
context in general as well as our subject namely relevant abstracts. Since a direct query of an LLM is
prone to hallucinations (see section 1), we developed a prototype that melds the natural language
understanding capabilities of LLMs with established ML models. More concretely, we chose to apply
the OpenAI API for the generation of embeddings of the abstracts which we further processed with
a classifier. Note that the results are not vendor or product specific, as similar features can be used
in other LLMs, esp. open-source ones like S-BERT and Llama 2. The field is also developing
dynamically with new alternatives that are introduced on an almost monthly basis.
    The embeddings of the abstracts are further classified for relevance using an established ML
approach that – unlike a direct classification with a current LLM – can be applied with reasonable
costs and thereby supports RQ4. We decided for a balanced random forest (BRF) [43], as it mitigates
the imbalance between irrelevant and relevant papers, which is inherent to most literature reviews.
A BRF works like a Random Forest, apart from its tree-building step in which it under-samples the
majority class and ensures an equal representation of classes.
    This approach is further refined by feeding the classification results to a LLM for a final
evaluation. We found that this helped with spotting smaller semantic differences or eliminating
errors from the training data. In our case, we prompted OpenAI’s GPT-4 model with the instruction
to classify all abstracts that got unclear classification results from the random forests. This prompt
is formulated in natural language and enriched with 2-5 examples of abstracts labeled as relevant or
irrelevant (“few-shot learning”). With this step we are leveraging the capability of the LLM to
understand natural language and use it to classify a given text, based on a defined and specific
context.
    4.2. Drafting phase – prototype


Figure 1: Screenshot of the command line interface of the prototype

    In the following section, we explain our prototype, beginning with a high-level overview (cf.
Figure 2). This is followed by a more detailed look at each of the components.
   We implemented a Python-based tool with a command line interface as our prototype (cf. Figure
 1). In a first step we extract the abstracts from several literature databases and feed them into a
reference management software that is capable of being accessed with an APIs (Zotero). For the
next step, we compiled a training dataset of abstracts with known results that we used for training
the balanced random forest classifier (training component). The classifier operates with the
embeddings (vectors) of the abstracts that are generated by the OpenAI-embedding endpoint. The
model is then employed as a preliminary classification tool for narrowing down an extensive corpus
of papers retrieved by a full scan of an initial broad search. It filters down the results to only those
relevant to the subject at hand. Unclear results then undergo the refinement step that leverages
GPT-4 with a few-shot text classification (refinement component).


Figure 2: Classification process overview

    4.3. Training component
    The prototype needs several positive and negative examples for training. Since the classification
conducted later is grounded in this foundational data, careful consideration should be given to the
selection of these papers. They can for example come from an initial exploratory search or from a
previous SLR. Our results indicate that about ten relevant and ten irrelevant papers are sufficient to
achieve a satisfactory result. It should be noted that the irrelevant papers should vary in topic and
the training examples should include edge-cases.
    For each abstract in the training set, the prototype creates an object consisting of the paper’s
title, its abstract, and its authors. This object is sent to the OpenAI API to create the text embedding
which is persisted together with a corresponding relevancy tag. This action can be performed for
batches of abstracts at once. The prototype also provides the possibility to update the training set
and to revert changes to the training set to a previous point. After that, a Balanced Random Forest is
trained using the training vector embeddings. Fig 2 shows an overview of the data preparation
steps.
Figure 3: Overview of training data preparation

    4.4. (Trained) classifier component
    After training the RBF, it can be applied to pre-classify all abstracts stored in the reference
management software. This is achieved by creating vector embeddings for each abstract in the same
way as for the training data. The BRF is then used to predict the relevancy of the abstracts (it
calculates a probability that the abstract is in the “relevant paper class”). In our tests, the following
thresholds delivered reliable results: A relevancy score below 0.6 is classified as ‘irrelevant’ and
above 0.75 as “relevant”. These thresholds are based on our use of the prototype to provide a
balanced tradeoff between the avoidance of false negatives and false positives (RQ2 and RQ3). They
can be changed depending on the specific setting and the required rigor of the SLR.
    The initial classification already narrows down the number of hits substantially, but it sometimes
fails to classify a paper with sufficient certainty, in our case meaning an assigned relevancy value
between 0.6 and 0.75.


    4.5. LLM refinement component
    To further refine the search within this uncertain category, we employed GPT-4. The LLM was
fed with a detailed explanation of the SLR topic, along with four examples of abstracts marked as
relevant and irrelevant. This additional step helped to further narrow down the search.
    The following is an example of the structure of the data provided to the LLM. As we used GPT-4,
which is trained as a conversational LLM, we were required to provide the information in a chat-
like structure. The prompt we used was “You are a classifier that predicts whether a paper is relevant
based on a prompt. Only ever answer with 'relevant' or 'not relevant'.”
    This system-prompt is followed by a “simulated”, precomposed conversation between the user
and the assistant. The user messages are in the format:

        PROMPT: Is this a paper about automating or semi-automating the process of a systematic
        literature review? Think carefully and read thoroughly. If the paper is not about the
        automation potential of SLRs, answer 'not relevant'.
        TITLE: the title of the paper
        ABSTRACT: the abstract of the paper

For each of these user messages the response of the assistant is either ‘relevant’ or ‘irrelevant’,
depending on the abstract provided. This is done for two relevant and two irrelevant papers. Figure
 4 depicts the structure of the prompt that is provided to the LLM.
Figure 4: Structure of the few-shot classification prompt

Afterwards the LLM is given the abstracts with uncertain classification results. The outputs of the
LLM are parsed and converted. Should the output be anything else than ‘relevant’ or ‘irrelevant’,
the prototype messages the user and asks for a manual classification. Notably, this did not happen
once during our testing. Although it might sound appealing to use this GPT-4 based approach for all
papers, it's important to mention that using GPT-4 to classify every abstract in a multi-shot manner
would be financially prohibitive in many cases (not matching RQ4). Hence, the rationale behind
merging the two approaches, random forest classification and GPT-4 review, for an efficient and
cost-effective classification process. Note that these restrictions might become obsolete with future,
cost efficient LLMs.
   After classifying the abstracts, the user has the possibility to manually make changes to the
relevancy classification of the models. After that, the classification can be exported, and the user can
decide if the classification should be added to the training set of the Random Forest Classifier.

    4.6. Evaluation phase
  In our preliminary evaluation, we scrutinized the solution with respect to our four requirements.
  RQ1: Reproducibility. A cornerstone of the SLR is the ability of other researchers to validate
the findings (RQ1). While LLMs themselves might be a black box and are not inherently explainable,
our prototype was developed with reproducibility in mind and allows the exact reproduction of the
conducted SLR given the following prerequisites are met:

       1.   The papers that are analyzed by the prototype must be the same.
       2.   The papers to train the random forest must be the same.
       3.   The object that is transformed to a vector needs to be in the same format.
       4.   The vector produced for each object must be the same as in the original SLR.
       5.   The “seed” (initialization of the randomization) for the Random Forest must be the same.
       6.   The system-prompt for the LLM must be the same.
       7.   The few-shot examples for the LLM must be the same.
       8.   The hyperparameter “temperature” of the LLM must be set to zero to produce
            deterministic responses.

    This incurs that in a publication of the results of a SLR conducted with this solution, the inputs
for the points above are ideally published alongside the results. To facilitate this, our prototype
creates a ‘receipt’ file that documents these parameters and can be used to reproduce the
classification. This document could be added to the paper or uploaded to some repository to
decrease friction for researchers wanting to reproduce the SLR.
   RQ2-RQ4: False positives and false negatives and efficiency. Three researchers applied the
prototype in two literature research projects. The average got false positive rates of around 0.5 with
minimal false negative rates. It needs to be noted that we were able to reduce the person hours
necessary for the literature screening by a factor between 6 and 10.

5. Discussion
In this paper we present a novel approach of leveraging Transformer-Based LLMs to substantially
decrease the manual workload of abstract screening during SLRs, while avoiding missing relevant
literature (false negatives). This is accomplished by chaining two different classification steps, in
form of a Random Forest Classifier to classify contextualized embeddings of the abstracts and GPT-4
in a multi-shot classification form.
   Despite all benefits, our approach also comes with some drawbacks. Mainly, the researcher must
give away control to a black box. This can also make the SLR less transparent for other researchers.
Still, it must be considered, that the existing approach for SLR is not without flaws either.
   Our testing showed that LLMs are not yet capable to automate SLR in a “zero-shot” fashion, i. e.
just asking for relevant papers for some research topic. However, in our case a 4-example few-shot
learning approach already led to satisfactory results when providing an abstract to classify.
    As following research steps, we particularly want to address a more rigorous evaluation. We
conceive an experiment-based setting in which this LLM-based solution is systematically compared
with a manual screening, a keyword-based approach, and a BoW approach.
    It needs to be highlighted that the LLM field is continually developing, and it is prudent to
assume that within the foreseeable future, they can support or even automate more steps of an SLR,
from the support of the conceptualization of the literature review to the proposition of a research
agenda.
References
  [1]    A. Fink, Conducting research literature reviews : from the internet to paper, Sage
       Publications, Los Angeles, 2014.
  [2] J. Webster, R. Watson, Analyzing the Past to Prepare for the Future: Writing a Literature
       Review, MIS Quarterly 26 (2002) 13–23.
  [3] Y. Levy, T. J. Ellis, A Systems Approach to Conduct an Effective Literature Review in
       Support of Information Systems Research, InformingSciJ. 9 (2006) 181–212.
       https://doi.org/10.28945/479.
  [4] C. Okoli, K. Schabram, A Guide to Conducting a Systematic Literature Review of
       Information Systems Research, SSRN Journal (2010). https://doi.org/10.2139/ssrn.1954824.
  [5] J. vom Brocke, A. Simons, B. Niehaves, K. Riemer, R. Plattfaut, A. Cleven, Reconstructing
       The Giant On The Importance Of Rigor In Documenting The Literature Search Process, in:
       Proceedings of the 17th European Conference on Information Systems. , Verona, 2009.
  [6] R. van Dinter, B. Tekinerdogan, C. Catal, Automation of Systematic Literature Reviews: A
       Systematic Literature Review, Information and Software Technology 136 (2021) 1–16.
       https://doi.org/10.1016/j.infsof.2021.106589.
  [7] S. Jusoh, A study on nlp applications and ambiguity problems. Journal of Theoretical and
       Applied Information Technology 96 (2018) 1486–1499.
  [8] L. Galke, A. Diera, B. X. Lin, B. Khera, T. Meuser, T. Singhal, F. Karl, A. Scherp, Are We
       Really Making Much Progress in Text Classification? A Comparative Review, 2023. URL:
       http://arxiv.org/abs/2204.03954.
  [9] X. Chen, J. Ye, C. Zu, N. Xu, R. Zheng, M. Peng, J. Zhou, T. Gui, Q. Zhang, X. Huang, How
       Robust is GPT-3.5 to Predecessors? A Comprehensive Study on Language Understanding
       Tasks, 2023. https://doi.org/10.48550/arXiv.2303.00293.
  [10]     A. Koubaa, GPT-4 vs. GPT-3.5: A Concise Showdown, 2023. URL:
       https://www.techrxiv.org
       /articles/preprint/GPT-4_vs_GPT-3_5_A_Concise_Showdown/22312330/2.
       https://doi.org/10.36227/techrxiv.22312330.v2.
  [11] Y. Liu, T. Han, S. Ma, J. Zhang, Y. Yang, J. Tian, H. He, A. Li, M. He, Z. Liu, Z. Wu, D. Zhu,
       X. Li, N. Qiang, D. Shen, T. Liu, B. Ge, Summary of ChatGPT/GPT-4 Research and
       Perspective Towards the Future of Large Language Models, 2023. URL:
       http://arxiv.org/abs/2304.01852. https://doi.org/10.48550/arXiv.2304.01852.
  [12] R. Qureshi, D. Shaughnessy, K.A.R. Gill, K.A. Robinson, T. Li, E. Agai, Are ChatGPT and
       large language models “the answer” to bringing us closer to systematic review automation?
       Systematic Reviews 12 (2023). https://doi.org/10.1186/s13643-023-02243-z.
  [13] J. Camacho-Collados, M.T. Pilehvar, Embeddings in Natural Language Processing, in:
       Proceedings of the 28th International Conference on Computational Linguistics: Tutorial
       Abstracts, International Committee for Computational Linguistics, Barcelona, Spain
       (Online) 2020, pp. 10–15. https://doi.org/10.18653/v1/2020.coling-tutorials.2.
  [14] D. Sullivan, Document Warehousing and Text Mining: Techniques for Improving Business
       Operations, Marketing, and Sales, New York, 2001.
  [15] N. Milić-Frayling, Text processing and information retrieval, in: A. Zanasi (Ed.), WIT
       Transactions on State of the Art in Science and Engineering, WIT Press, 2005, pp. 1–45.
       https://doi.org/10.2495/978-1-85312-995-7/01.
  [16] G. Sidorov, Vector space model for texts and the tf-idf measure. in: SpringerBriefs in
       Computer Science. Springer, 2019, pp. 11–15. https://doi.org/10.1007/978-3-030-14771-6_3.
  [17] S. Selva Birunda, R. Kanniga Devi, A Review on Word Embedding Techniques for Text
       Classification, in: J.S. Raj, A.M. Iliyasu, R. Bestak and Z.A. Baig (Eds.), Innovative Data
       Communication Technologies and Application, Springer, Singapore, 2021, pp. 267–281.
       https://doi.org/10.1007/978-981-15-9651-3_23.
  [18] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient Estimation of Word Representations in
       Vector Space, 2013. URL: http://arxiv.org/abs/1301.3781.
[19] M.E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, L. Zettlemoyer, Deep
     Contextualized Word Representations, in: Proceedings of the 2018 Conference of the North
     American Chapter of the Association for Computational Linguistics: Human Language
     Technologies, Volume 1 (Long Papers), Association for Computational Linguistics, New
     Orleans, Louisiana, 2018, pp. 2227–2237. https://doi.org/10.18653/v1/N18-1202.
[20] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser, I.
     Polosukhin, Attention is All you Need. in: Advances in Neural Information Processing
     Systems. Curran Associates Inc., 2017.
[21] J. de la Torre-Lopez, A. Ramirez, J.R. Romero, Artificial intelligence to automate the
     systematic       review       of      scientific      literature,      COMPUTING,         2023.
     https://doi.org/10.1007/s00607-023-01181-x.
[22] L. Cairo, G.F. de Carneiro, M.P. Monteiro, F.B. e Abreu, Towards the Use of Machine
     Learning Algorithms to Enhance the Effectiveness of Search Strings in Secondary Studies,
     in: Proceedings of the XXXIII Brazilian Symposium on Software Engineering. Association
     for    Computing       Machinery,     New       York,     NY,     USA,   2019,     pp   22–26.
     https://doi.org/10.1145/3350768.3350772.
[23] A.E. Kwabena, O.-B. Wiafe, B.-D. John, A. Bernard, F.A.F. Boateng, An automated method
     for developing search strategies for systematic review using Natural Language Processing
     (NLP), METHODSX 10 (2023). https://doi.org/10.1016/j.mex.2022.101935.
[24] L. Feng, Y.K. Chiam, S.K. Lo, Text-Mining Techniques and Tools for Systematic Literature
     Reviews: A Systematic Literature Review, in: Proceedings of the 24th Asia-Pacific Software
     Engineering Conference (APSEC), 2017, pp. 41–50. https://doi.org/10.1109/APSEC.2017.10.
[25] H. Muller, S. Pachnanda, F. Pahl, C. Rosenqvist, The application of artificial intelligence on
     different types of literature reviews - A comparative study, in: Proceedings of the 2022
     International Conference on Applied Artificial Intelligence (ICAPAI). IEEE Norway Sect,
     CIS Chapter, 2022, pp. 38–44. https://doi.org/10.1109/ICAPAI55158.2022.9801564.
[26] Y. Shakeel, J. Krüger, I. von Nostitz-Wallwitz, C. Lausberger, G.C. Durand, G. Saake, T.
     Leich, (Automated) literature analysis: threats and experiences, in: Proceedings of the
     International Workshop on Software Engineering for Science, Association for Computing
     Machinery, New York, NY, USA, 2018, pp. 20–27. https://doi.org/10.1145/3194747.3194748.
[27] Y. Shakeel, J. Krüger, I.V. Nostitz-Wallwitz, G. Saake, T. Leich, Automated Selection and
     Quality Assessment of Primary Studies: A Systematic Literature Review, J. Data and
     Information Quality 12 (2019). https://doi.org/10.1145/3356901.
[28] R. Ros, E. Bjarnason, P. Runeson, A Machine Learning Approach for Semi-Automated
     Search and Selection in Literature Studies, in: Proceedings of the 21st International
     Conference on Evaluation and Assessment in Software Engineering, Association for
     Computing       Machinery,       New      York,     NY,      USA,     2017,    pp.    118–127.
     https://doi.org/10.1145/3084226.3084243.
[29] T. Tsubota, D. Bollegala, Y. Zhao, Y. Jin, T. Kozu, Improvement of intervention information
     detection for automated clinical literature screening during systematic review, JOURNAL
     OF BIOMEDICAL INFORMATICS 134 (2022). https://doi.org/10.1016/j.jbi.2022.104185.
[30] M. Michelson, K. Reuter, The significant cost of systematic reviews and meta-analyses: A
     call for greater involvement of machine learning to assess the promise of clinical trials,
     CONTEMPORARY             CLINICAL        TRIALS        COMMUNICATIONS            16     (2019).
     https://doi.org/10.1016/j.conctc.2019.100443.
[31] A.M. Cohen, W.R. Hersh, K. Peterson, P.-Y. Yen, Reducing Workload in Systematic Review
     Preparation Using Automated Citation Classification, J Am Med Inform Assoc 13 (2006)
     206–219. https://doi.org/10.1197/jamia.M1929.
[32] A. O’Mara-Eves, J. Thomas, J. McNaught, M. Miwa, S. Ananiadou, Using text mining for
     study identification in systematic reviews: a systematic review of current approaches, Syst
     Rev 4 (2015). https://doi.org/10.1186/2046-4053-4-5.
[33] W. Kusa, A. Hanbury, P. Knoth, Automation of Citation Screening for Systematic Literature
     Reviews Using Neural Networks: A Replicability Study, in: M. Hagen, S. Verberne, C.
     Macdonald, C. Seifert, K. Balog, K. Nørvåg and V. Setty (Eds.), Advances in Information
     Retrieval, Springer International Publishing, Cham, 2022, pp. 584–598.
[34] W. Kusa, A. Lipani, P. Knoth, A. Hanbury, An analysis of work saved over sampling in the
     evaluation of automated citation screening in systematic literature reviews, Intelligent
     Systems with Applications 18 (2023). https://doi.org/10.1016/j.iswa.2023.200193.
[35] A. Bravo, L. Bennetts, P. Atanasov, Accelerating the Early Identification of Relevant Studies
     in Title and Abstract Screening, in: Prodeedings of the 2021 INTERNATIONAL
     SYMPOSIUM ON COMPUTER SCIENCE AND INTELLIGENT CONTROLS (ISCSIC 2021),
     2021, pp. 132–140. https://doi.org/10.1109/ISCSIC54682.2021.00034.
[36] T. Georgieva-Trifonova, Continued Supporting a Systematic Literature Review by
     Applying Text Mining Methods, in: 2022 21st International Symposium INFOTEH-
     JAHORINA (INFOTEH), 2022, pp 1–5. https://doi.org/10.1109/INFOTEH53737.2022.9751318.
[37] G. Rizzo, F. Tomassetti, A. Vetro, L. Ardito, M. Torchiano, M. Morisio, R. Troncy, Semantic
     enrichment for recommendation of primary studies in a systematic literature review,
     DIGITAL       SCHOLARSHIP         IN     THE     HUMANITIES         32    (2017)    195–208.
     https://doi.org/10.1093/llc/fqv031.
[38] R. Alchokr, M. Borkar, S. Thotadarya, G. Saake, T. Leich, Supporting Systematic Literature
     Reviews Using Deep-Learning-Based Language Models. in: 2022 IEEE/ACM 1st
     International Workshop on Natural Language-Based Software Engineering (NLBSE), 2022,
     pp. 67–74. https://doi.org/10.1145/3528588.3528658.
[39] K. Peffers, T. Tuunanen, M.A. Rothenberger, S. Chatterjee, A Design Science Research
     Methodology for Information Systems Research, Journal of Management Information
     Systems 24 (2007) 45–77. https://doi.org/10.2753/MIS0742-1222240302.
[40] A.R. Hevner, S.T. March, J. Park, S. Ram, Design Science in Information Systems Research.
     MIS Quarterly 28 (2004) 75–105. https://doi.org/10.2307/25148625.
[41] H. Österle, R. Winter, W. Brenner (Eds.), Gestaltungsorientierte Wirtschaftsinformatik: ein
     Plädoyer für Rigor und Relevanz, Infowerk, Nürnberg, 2010.
[42] H. Scells, G. Zuccon, B. Koopman, Automatic Boolean Query Refinement for Systematic
     Review Literature Search, in: The World Wide Web Conference, Association for Computing
     Machinery,        New         York,      NY,      USA,       2019,       pp.      1646–1656.
     https://doi.org/10.1145/3308558.3313544.
[43] L. Kobyliński, A. Przepiórkowski, Definition Extraction with Balanced Random Forests, in:
     B. Nordström and A. Ranta (Eds.), Advances in Natural Language Processing, Springer
     Berlin Heidelberg, Berlin, Heidelberg, 2008, pp.237–247. https://doi.org/10.1007/978-3-540-
     85287-2_23.