<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The Impact of Different Training Sets on Medical Documents Classification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Roberto Gatta</string-name>
          <email>roberto.gatta.bs@alice.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mauro Vallati</string-name>
          <email>m.vallati@hud.ac.uk</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Berardino De Bari</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mahmut Ozsahin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Radiation Oncology, Centre Hospitalier Universitaire Vaudois</institution>
          ,
          <country country="CH">Switzerland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Radiation Oncology, Universita` Cattolica S. Cuore</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>School of Computing and Engineering, University of Huddersfield</institution>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The clinical documents stored in a textual and unstructured manner represent a precious source of information that can be gathered by exploiting Information Retrieval techniques. Classification algorithms can be used for organizing this huge amount of data, but are usually tested on standardized corpora, which significantly differ from actual clinical documents that can be found in a modern hospital. The result is that observed performance are different from expected ones. Given such differences, it is unclear how should be the “right” training set, and how its characteristics affects the classification performance. In this paper we present the results of an experimental analysis, conducted on actual clinical documents from a medical Department, which aims to evaluate the impact of differently sized and assembled training sets on well-known classification techniques.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        In modern hospitals a large amount of clinical documents are stored in a textual and
unstructured manner; these documents are precious sources of knowledge that must
be exploited rather than uselessly stocked. In order to exploit such knowledge, it is
fundamental to classify the documents. Information Retrieval (IR) techniques provide
an established way to distinguish the documents according to their general meaning
(see, for instance [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]).
      </p>
      <p>
        The traditional approach to IR envisages to exploit a large number of already
classified documents — the ground truth — for training classification algorithms. Generally,
we larger the training set, the better the expected performance. Usually, IR approaches
are evaluated on standard corpora, that are significantly different from documents that
can be found in real-world environments. In such environments, and especially in
medical ones, several factors can affect the performance of IR classifiers, and limit the
usefulness of extremely large training sets. Probably, the most critical one is the so-called
documents obsolescence [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. It refers to the fact that in a clinical context the turn-over
of human resources, and the introduction of new techniques and methodologies, can
quickly change the text style of medical reports; documents of the training set that
include obsolete terms or structure can play the role of noise for the classification process.
Therefore, the usual approach based on exploiting large training sets could be not the
best technique.
      </p>
      <p>In this paper we perform an experimental analysis, on about 3,000 medical
documents from a Radiotherapy Department, which aims to evaluate how classification
performance are affected by (i) differently sized training sets, and (ii) the similarity of
training documents with a given one.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Considered IR Algorithms</title>
      <p>
        For the sake of this investigation, we considered three existing classifiers: Rocchio [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ],
ESA [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and Naive Bayes [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Rocchio and Naive Bayse are well-known in literature,
thus they represent the state-of-the-art. while ESA is a recent and somehow different
classification algorithm.
      </p>
      <p>
        Rocchio classifier uses a Vector Space Model (VSM) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] to generate a multi-dimensional
space where a document is represented as a vector, which components are functions of
the frequencies of the terms. For each class of documents, a centroid is generated. New
documents are classified as members of the class whose centroid is closer. Rocchio
suffers of low accuracy while it has to classify documents that are close to the boundaries
of a centroid. Our implementation adopted the tf-idf [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] technique to weight the terms in
documents and used an Euclidean Distance metric to measure distances from centroids.
      </p>
      <p>ESA is based on the idea of entropy, and exploits a two step training process. In
the first step it selects a set of terms that better helps to predict the probability p(ti/cj )
that a document is classified as cj given the fact that it contains the term ti. In the
second step, ESA calculates the entropy values associated to each term and discharges
the terms which entropy is over a given threshold. For classifying a new document, the
score score(cj ) of each class is determined using Equation 1. The class with higher
score is selected.</p>
      <p>score(cj ) =
n
Y [1 − p(cj /ti)]
i=1
(1)</p>
      <p>A Naive Bayes (NB) classifier uses a Bayesian approach to calculate the probability
that a document is a member of every possible class. Even if it is based on the strong
hypothesis of conditional independence between features, NB usually shows good
performance; moreover it allows to estimate the uncertainty by evaluating the probability
ratios between all the couples of possible classes.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Experimental Analysis</title>
      <p>Clinical documents were collected from a Radiotherapy Department. It contributed with
discharge “forms”. Each form is composed by 21 different documents, which should be
classified according to the aspect of the patient they describe; the usage of tobacco and
alcohol, allergies, medications, treatment plan, etc. The total number of document is
about 3,000, written in French, that were divided in 21 classes, as previously stated.</p>
      <p>The documents are generally short (94 is the average number of words) and their
structure can be significantly different, since no guidelines or “standard sentences” are
proposed to physicians by the input system. We observed that different physicians wrote
documents in very different ways, both from structural and syntactic point of view.</p>
      <p>
        Out of the available 3,000 documents, 2,700 have been considered for training,
while the rest for testing the different algorithms. Since the focus of the analysis is
on the differences between large and small training sets, rather than on the testing
accuracy itself, we decided to exploit a very large amount of available data for training.
We considered different percentages of the learning set, ranging from 1% to 99%. In
order to evaluate how quality of training instances affect classification performance, we
decided to select training documents which are “close” to the given one. In other words,
the given document plays the role of ”centroid” for a kNN [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] extraction which aim is to
build the training set. The rationale is that, in order to limit the detrimental effect of
obsolescence on classification performance, looking at documents that have been already
classified and are similar to the given one should provide useful information. Similar
documents can either be written in the same period or from the same physician.
Distance of documents have been quantified by evaluating the euclidean distance between
the tf-idf of the documents on all the words. Higher percentages of considered training
documents imply that less similar documents are exploited for training IR algorithms
and, potentially, that more noise is introduced.
      </p>
      <p>Figure 1 shows how the considered percentages of ordered (w.r.t. tf-idf) documents
of the training sets affect the accuracy performance of the IR algorithms. Remarkably,
the three algorithms show different behaviours. Rocchio shows the best accuracy while
using only the 10% of the available training set; its performance are then monotonically
decreasing when the number of training documents growth. Naive Bayes accuracy
performance remain stable while considering training sets with a size between 1 and 60%
of all the available training documents. After that the accuracy decreases quickly.
Finally, ESA is the only considered approach in which accuracy proportionally increases
with the size of the exploited training set. It is worth noticing that all the algorithms
show somehow good performance, considering that the documents can be classified in
21 different classes, also when exploiting a very small number of training problems.
This is probably due to the good quality of training sets, which derives from using the
tf-idf technique for selecting them.</p>
      <p>Interestingly, the average accuracy among the three algorithms monotonically
increases between 1 and 20%, monotonically decreases from 70% upward, and remains
almost the same in between. While a lower accuracy with very small training set was
expected, it is surprising that also very large sets lead to reduced accuracy. This
suggests that the approach “the more the better” should be revised and improved with better
selection techniques of training sets.</p>
      <p>For a better comprehension of the actual impact of a good selection of training
instances, we compared the accuracy performance of Rocchio exploiting (i) the
aforementioned tf-idf based technique and (ii) a random selection of documents from the training
set. Figure 2 shows the results of such comparison. Remarkably, the performance gap
is significant. A good selection of training instances lead to an evident performance
improvement. Moreover, while exploiting a random selection of training documents, the
size of the training set does not significantly affect the classification performance.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>In modern hospitals a large amount of clinical documents, which represent a precious
source of information, are stored in a textual and unstructured manner. In order to
exploit such knowledge, it is fundamental to classify the documents using IR algorithms.
Traditional IR approaches are tested and evaluated on standard corpora, that usually
have characteristics which are very different from those of real-world documents. One
of the aspects, observed in clinical documents but not in standard corpora, that has a
remarkable impact on IR performance is the obsolescence, which refers to the fact that
turn-over of human resources, and introduction of new techniques and methodologies,
can quickly change the text style of reports. The presence of such sudden changes in
real-word text corpora makes the standard learning approach — the larger the training
set, the better — questionable; documents of the training set that include obsolete terms
or structure, w.r.t. the current document to classify, can play the role of noise.</p>
      <p>In this work we experimentally evaluated how differently sized and differently
assembled training sets affect the classification performance of three IR approaches on
clinical documents from a Radiotherapy Department. The take-home messages that can
be synthesised are: (i) the size of the training set does not significantly affect the
classification performance; (ii) a good selection of training instances can boost the accuracy,
i.e. selecting training instances which are similar to the one to classify, according to
some metrics. While the latter is intuitive, the former result is astonishing. It clearly
indicates that focusing on collecting large amount of training documents is not always
the best strategy for achieving good performance, at least on considered algorithms.</p>
      <p>
        We see several avenues for future work. Concerning the documents, we are
interested in performing a larger experimental evaluation on documents from different
departments. It can be expected that, in departments that support physicians through
guidelines or “standard sentences”, selecting training instances will not have
remarkable impact on IR performance. Moreover, we will extend the set of considered
classification algorithms by including more well-known approaches (e.g., KNN [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]) and
ensemble methods [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. It will be useful to understand if such methods, which combine
different classification algorithms, suffer noticeably documents obsolescence, and how
different training sets affect their performance.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bratsas</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koutkias</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kaimakamis</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bamidis</surname>
            ,
            <given-names>P.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pangalos</surname>
            ,
            <given-names>G.I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maglaveras</surname>
          </string-name>
          , N.:
          <article-title>Knowbasics-m: An ontology-based system for semantic management of medical problems and computerised algorithmic solutions</article-title>
          .
          <source>Computer methods and programs in biomedicine 88(1)</source>
          ,
          <fpage>39</fpage>
          -
          <lpage>51</lpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Dietterich</surname>
          </string-name>
          , T.G.:
          <article-title>Ensemble methods in machine learning</article-title>
          .
          <source>In: Multiple classifier systems, LBCS-1857</source>
          . pp.
          <fpage>1</fpage>
          -
          <lpage>15</lpage>
          . Springer (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Frank</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bouckaert</surname>
            ,
            <given-names>R.R.</given-names>
          </string-name>
          :
          <article-title>Naive bayes for text classification with unbalanced classes</article-title>
          .
          <source>In: In Proc 10th European Conference on Principles and Practice of Knowledge Discovery in Databases</source>
          . pp.
          <fpage>503</fpage>
          -
          <lpage>510</lpage>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Gatta</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vallati</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>De Bari</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pasinetti</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cappelli</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pirola</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salvetti</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Buglione</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Muiesan</surname>
            ,
            <given-names>M.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Magrini</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , et al.:
          <article-title>Information retrieval in medicine: an extensive experimental study</article-title>
          .
          <source>In: the 7th International Conference on Health Informatics (HealthInf)</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Guo</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bell</surname>
            ,
            <given-names>D.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bi</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Greer</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>An knn model-based approach and its application in text categorization</article-title>
          .
          <source>In: In Proc 5th International Conference on Computational Linguistics and Intelligent Text Processing</source>
          . pp.
          <fpage>559</fpage>
          -
          <lpage>570</lpage>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Rocchio</surname>
            ,
            <given-names>J.J.:</given-names>
          </string-name>
          <article-title>Relevance feedback in information retrieval</article-title>
          . In:
          <article-title>The Smart retrieval system - experiments in automatic document processing</article-title>
          , pp.
          <fpage>313</fpage>
          -
          <lpage>323</lpage>
          . Englewood Cliffs, NJ: PrenticeHall (
          <year>1971</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Salton</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wong</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>C.S.:</given-names>
          </string-name>
          <article-title>A vector space model for automatic indexing</article-title>
          .
          <source>Commun. ACM</source>
          <volume>18</volume>
          (
          <issue>11</issue>
          ),
          <fpage>613</fpage>
          -
          <lpage>620</lpage>
          (
          <year>1975</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Salton</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Buckley</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Term-weighting approaches in automatic text retrieval</article-title>
          .
          <source>In: Information processing and management</source>
          . pp.
          <fpage>513</fpage>
          -
          <lpage>523</lpage>
          (
          <year>1988</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>