<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Study of Convolutional Neural Networks for Clinical Document Classi cation in Systematic Reviews: SysReview at CLEF eHealth 2017</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Grace Eunkyung Lee</string-name>
          <email>leee0020@e.ntu.edu.sg</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Computer Science and Engineering Nanyang Technological University</institution>
          ,
          <country country="SG">Singapore</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Identifying eligible documents for systematic reviews is one of the most time-consuming steps in writing the reviews. From retrieving numerous clinical documents to manually checking the documents with detailed criteria requires a tremendous amount of time and skilled workforce. In this paper, to increase the e ciency of the process we examine the role of convolutional neural networks for classifying medical documents for systematic reviews. The analysis is carried out in the context of the CLEF 2017 eHealth Task 2 as a participant. The evaluation demonstrates that the suggested methods show slightly better performance for full document screening than abstract screening.</p>
      </abstract>
      <kwd-group>
        <kwd>document classi cation</kwd>
        <kwd>systematic review</kwd>
        <kwd>convolutional neural network</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Recognizing relevant documents out of thousands of documents is one of the
most time-consuming yet important steps in writing systematic reviews.
Systematic reviews analyze and appraise all pertinent literature that meets a set
of pre-de ned eligibility criteria. Before analyzing selected literature for a
review, systematic review authors need to lter related documents by manually
investigating numerous documents for their eligibility. Since missing out relevant
documents is critical, researchers initially collect thousands of documents from
several databases which might be eligible for a review. The collected documents
are thoroughly examined for eligibility through two steps of abstract and full
document screenings.</p>
      <p>
        There have been several studies to automatic the laborious screening process.
However, imbalanced data and di erent levels of complexity for eligibility criteria
make automating the process a challenging task. Speci cally, among 50 Cochrane
systematic reviews more than 5,000 documents are initially collected on average,
and only around 20 documents are turned out to be eligible for the review
as indicated in Table 1. Furthermore, systematic reviews have a broad range
of topics from education for health professionals to heart disease and blood
circulation. The review topic and its scope lead to manifold eligibility criteria [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        The approaches toward solving the issues have been proposed for the past
years. Regarding to the imbalanced data, negative undersampling and weighting
schemes are used and, especially, active learning showed promising performance
to settle the limited number of positive data [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Moreover, the majority of
existing work for improving screening process applied feature selections and
conventional machine learning algorithms such as SVM to train classi ers. In addition,
in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], systematic reviews from two di erent domains are evaluated and show
di erent characteristics, but the number of reviews is limited and diversity of
review topics can be further expanded.
      </p>
      <p>
        In this paper, we examine the e cacy of convolutional neural networks(CNN)
for medical document classi cation in systematic reviews. The analysis of the
approach is carried out in the context of CLEF 2017 eHealth Task 2:
Technologically Assisted Reviews in Empirical Medicine [
        <xref ref-type="bibr" rid="ref1 ref3">1, 3</xref>
        ]. The contribution of this
approach is studying a modern machine learning algorithm, CNN, on the task of
identifying eligible clinical documents, despite the challenges of imbalanced data
distribution. To resolve the imbalance of data with the small number of positive
cases, we train the model on sentence-level context, rather than document-level,
with undersampling. We also concatenate context of systematic reviews criteria
with sentences of the collected documents. This provides a hint of anchor
information of to which systematic review each sentence is related. We evaluate
various combinations of contexts from reviews and documents with CNN.
      </p>
      <p>The remainder of the paper is laid out as follows. In Section 2 we present data
description and Section 3 provides detailed description of our approach on the
task and variations of models. In Section 5 we evaluate our models and analyze
the results. Finally, we conclude the paper by summarizing the major results in
Section 5.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Data description</title>
      <p>
        In this work, we use the CLEF 2017 eHealth Task 2 dataset [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The dataset
consists of 50 diagnostic test accuracy (DTA) Cochrane systematic reviews. The
reviews include a title, boolean queries, and PMIDs retrieved from the queries.
Besides, PMIDs are indicated for eligibility results after two-stage screening:
abstract screening, and full document screening.
      </p>
      <p>Table 1 demonstrates statistics of medical documents collected from 50
systematic reviews and the number of documents as a result of examining title
and abstract, and full document screening, respectively. From the Table 1, we
can see that the initial collection contains numerous medical research papers. In
contrast, the number of positive documents after abstract screening is a small
fraction of the entire collection. Even further, after full document screening, the
nal number of documents to be included in the reviews is dramatically reduced
from the initial collection of documents. Hence, the collection of documents
retrieved via boolean queries are noisy and contain many irrelevant documents for
reviews.
Di erent from common approaches to document categorization or sentiment
analysis, several inherent characteristics of systematic reviews make the current
task unique and challenging. One characteristic of the task is scarcity of positive
data. The number of nal positive documents is not enough to train a model for
a review since it is often less than 50. In spite of adopting techniques of reducing
the imbalance, the absolute number of positive documents is still not su cient.
In order to overcome data sparsity, we combine all documents in training dataset
and utilize sentences as a training unit to build one general classi er for DTA
systematic reviews.</p>
      <p>Training a general classi er leads to face another challenge. In this task, each
document can be classi ed either positive or negative, depending on eligibility
criteria, and the eligibility criteria vary over systematic reviews. For instance, a
medical document is positive in review A and negative in review B because of
di erent criteria, even though the document is retrieved in both the reviews. As
a result, one document labeled positive and negative becomes training inputs for
one classi er. Thus, a document or sentence itself is not able to be a stand-alone
input as training data.</p>
      <p>
        To resolve the challenge, we provide eligibility criteria with each sentence by
concatenating a title of reviews and sentences from clinical documents. Since
titles of reviews contain imperative elements of reviews in a brief format, we
believe that titles of reviews would provide a snippet of eligibility criteria of
reviews. The detailed description of concatenating eligibility criteria and sentences
are demonstrated in Section 4.3.
In this section, we explain a simple CNN with one layer of convolution on top
of word vectors. The model is a slight variant of the CNN architecture proposed
in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Let xi 2 Rk be the k-dimensional word embedding vector corresponding
to the i-th word in the sentence. A sentence s is represented as
(1)
where is the concatenation operator and n is the maximum length of sentences.
Sentences are padded to the maximum length if necessary. Likewise, review
context is represented as equation 1. A convolution layer involves a lter w 2 Rhk,
which is applied with a window size, h, to grasp information from surrounding
words. A new feature, ci capturing the context with a window of h words, is
generated by
ci = f (w xi:i+h 1 + b) ; 1
i
(n
h + 1)
(2)
where b is a bias term and function f is a hyperbolic tangent. As a result, a
sentence is represented by multiple feature vectors
      </p>
      <p>
        c = [c1; c2; : : : ; cn h+1]
with c 2 Rn h+1. Next, we max-pool the result of the convolutional layer into
a long feature vector, c = max fcg, which is to merge the results into the most
b
representative feature vector. We then incorporated the common regularization
method, dropout, to prevent feature vectors from co-adapting and force them
to learn useful features in a independent manner. For more details on dropout
we refer the reader to [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. After regularization we classify the result using a
softmax layer. Finally, predicted results of sentences are combined for document
classi cation, which is our ultimate task, and derive the document classi cation
as follows
      </p>
      <p>D =
1</p>
      <p>X p (s) ;
jDj s2D
(3)
(4)
where jDj is the total number of sentences in a document D and p (s) is the
prediction probability of sentence s derived from the CNN model.
4</p>
    </sec>
    <sec id="sec-3">
      <title>Experimental Setup</title>
      <p>In this section, we discuss our experimental setup to evaluate the e ectiveness of
the proposed approach for modeling a medical document classi er. In particular,
Section 4.1 describes the preprocessing and normalization according to
characteristics of biomedical text, while Section 4.2 presents the hyperparameters for
the CNN model. Section 4.3 discusses the variants of our approaches used in our
experiments.</p>
      <p>Undersampling is a common way to deal with imbalanced data. We used
negative undersampling when training classi ers because of the limited number
of eligible documents compared to irrelevant documents.</p>
      <p>Given ids for PubMed documents collected for 50 systematic reviews, we used
a title and abstract of clinical documents from PubMed. Even though the goal
of a task is to improve both of a title and abstract screening and full document
screening, we solely exploit only a title and abstract of the documents as input
data. We believe they contain the most important content of documents like a
summary.
Prior to classi cation, sentences from documents undergo normalization in which
a script using regular expressions simpli es complex numerical and
mathematical notation into a canonical form. All integers, real numbers, and percentage are
mapped to INT, FLOAT and PERCENT, respectively. Acronyms are appeared
with parenthesis when they are mentioned for the rst time, so the parenthesis
are eliminated and the acronyms are considered as single words. Lastly,
measurements such as dosages, 100g/d, are normalized by MEASUREMENT.
After normalization word tokens are represented by pre-trained word embedding.
In order to re ect characteristics of biomedical text, we leverage the pre-trained
Word2Vec vectors with PubMed and PubMed Central dumps 1. Since the word
embeddings are trained on the entire available biomedical literature, we believe
that it can e ectively capture semantics for the biomedical domain. The vector
representations has the dimensionality of 200. If words are not present in the set
of pre-trained words, they are initialized with all zeros.</p>
      <p>We use recti ed linear units and lter size (h) are set to 3, 4, 5, and 6. The
dropout rate is set to 0.5, mini batch size is 50, and L2 norm in regularization
is not used for the purpose of simplicity. For training, 20 systematic reviews
are employed and the training is conducted through stochastic gradient descent
over shu ed mini-batches. The rest 30 reviews are allocated for testing which is
identical with the set up of CLEF eHealth 2017 Task 2. The statistics of training
data for relevant and irrelevant sentences is presented in Table 2.
In this work, we try three variants of data concatenation between eligibility
criteria and documents to be evaluate. Eligibility criteria have various elements for
reviews and they are often described in a document so-called protocols. Rather
than accessing long and descriptive protocols about criteria, we consider a title
of systematic review as criteria, since a review title represents vital elements
of documents in a brief format. By providing a hint of eligibility criteria, each
sentence is di erentiated from which criteria it is evaluated on. The variations
of concatenation of criteria and sentence information are as follows.
1 http://bio.nlplab.org/
{ Cri-Titlesent: A model where a title of systematic review, a title of medical
document are concatenated to a sentence of abstract of the medical document
as pre x and utilized as input data.
{ Cri-Sent: Same as the model above but a title of systematic review is used
concatenated except a title of document.
{ Cri-Title: A model where a title of systematic review and a title of clinical
documents are combined. Sentences from abstract are not used in this model.</p>
      <p>Thus, compared to the previous models, it is built with less input data.
5</p>
    </sec>
    <sec id="sec-4">
      <title>Results and Discussion</title>
      <p>In this section, we present and discuss the results obtained by our models on the
test data of the task.</p>
      <p>
        Table 3 shows results of the three models on di erent evaluation
measurements. A wss@N indicates Work Saved over Sampling @ Recall and measures how
much a model reduces workload of reviewers [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Measuring reduced workload
has been one of the common evaluation approaches for the task of
automating screening process in systematic reviews. A norm area represents area under
the cumulative recall curve normalized by the optimal area. More details on
evaluation measures used in the task, we refer to [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>Since CNN architecture requires massive amount of training data to achieve
reasonable performance, the suggested models show poor performances. This
indicates that the models meed more consistent labeled data even though the
number of training data has been increased in this models. Compared to the two
models, Cri-Titlesent and Cri-Sent, the model Cri-Title displays lower
performance because it utilizes the fewer number of data for training.</p>
      <p>From the results of wss@100 and wss@95 presented in Table 3, the proposed
models have slightly better evaluation results on full document screening than
abstract screening. The relevance results of abstract screening include not only
relevant cases but also cases which cannot be judged because of the lack of
information in the currently given data. Hence, the results from abstract screening
might be less consistent.</p>
      <p>Besides, further investigations on the limited performances revealed that the
model fails to make right predictions when there is no abstract text. Some
relevant documents do not have abstract text in PubMed, only their titles. Therefore,
low-ranked relevant studies deteriorate the overall performances.</p>
      <p>We believe that performances of the models have room for improvement.
Handling the process of collecting abstract text of relevant studies from various
biomedical literature databases as well as PubMed, the increased training data,
and ne tuning on CNN architecture will lead to enhanced results. We leave this
part as future work for improvement.
6</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>In this work, we have presented simple CNN models for improving the
laborious task of identifying eligible documents for systematic reviews. The suggested
models are designed for any DTA systematic reviews even though every
systematic review is accompanied with di erent complexities of eligibility criteria. The
models take advantage of concatenated context from criteria and clinical
documents. The evaluation results show that while the performance of the proposed
approaches has room for improvement, they have higher performance in full
document screening than abstract screening. This work is a step towards applying
deep neural networks to improve the screening process despite the scarcity of
labeled documents and the data imbalance.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Lorraine</given-names>
            <surname>Goeuriot</surname>
          </string-name>
          et al. \
          <article-title>CLEF 2017 eHealth Evaluation Lab Overview"</article-title>
          .
          <source>In: CLEF 2017 - 8th Conference and Labs of the Evaluation Forum, Lecture Notes in Computer Science (LNCS)</source>
          . Springer.
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Julian</surname>
            <given-names>PT</given-names>
          </string-name>
          <string-name>
            <surname>Higgins and Sally Green</surname>
          </string-name>
          .
          <article-title>Cochrane handbook for systematic reviews of interventions</article-title>
          . Vol.
          <volume>4</volume>
          . John Wiley &amp; Sons,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Evangelos</given-names>
            <surname>Kanoulas</surname>
          </string-name>
          et al. \
          <article-title>CLEF 2017 Technologically Assisted Reviews in Empirical Medicine Overview"</article-title>
          . In: Working Notes of CLEF 2017 -
          <article-title>Conference and Labs of the Evaluation forum</article-title>
          .
          <source>CEUR Workshop Proceedings</source>
          .
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Yoon</given-names>
            <surname>Kim</surname>
          </string-name>
          . \
          <article-title>Convolutional neural networks for sentence classi cation"</article-title>
          .
          <source>In: arXiv preprint arXiv:1408.5882</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Makoto</given-names>
            <surname>Miwa</surname>
          </string-name>
          et al. \
          <article-title>Reducing systematic review workload through certaintybased screening"</article-title>
          .
          <source>In: Journal of biomedical informatics 51</source>
          (
          <year>2014</year>
          ), pp.
          <volume>242</volume>
          {
          <fpage>253</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Alison</given-names>
            <surname>OMara-Eves</surname>
          </string-name>
          et al. \
          <article-title>Using text mining for study identi cation in systematic reviews: a systematic review of current approaches"</article-title>
          .
          <source>In: Systematic reviews 4</source>
          .1 (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>