<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Large-Scale Semantic Indexing of Biomedical Publications at BioASQ</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Grigorios Tsoumakas</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Manos Laliotis</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nikos Markantonatos</string-name>
          <email>nikos@atypon.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ioannis Vlahavas</string-name>
          <email>vlahavas@csd.auth.gr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Aristotle University of Thessaloniki</institution>
          ,
          <addr-line>Thessaloniki 54124</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Atypon Hellas</institution>
          ,
          <addr-line>Dimitrakopoulou 7, Agia Paraskevi 15341, Athens</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Atypon</institution>
          ,
          <addr-line>5201 Great America Parkway Suite 510, Santa Clara, CA 95054</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Automated annotation of scienti c publications in real-world digital libraries requires dealing with challenges such as large number of concepts and training examples, multi-label training examples and hierarchical structure of concepts. BioASQ is a European project that contributes a large-scale biomedical publications corpus for working on these challenges. This paper documents the participation of our team to the large-scale biomedical semantic indexing task of BioASQ.</p>
      </abstract>
      <kwd-group>
        <kwd>multi-label learning</kwd>
        <kwd>semantic indexing</kwd>
        <kwd>biomedical literature</kwd>
        <kwd>text mining</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The amount of scienti c publications digitally available online is constantly
increasing. New conference publications and journal articles are continuously
added to digital libraries of publishers (e.g. Elsevier's sciencedirect, Springer's
springerlink), scienti c societies (e.g. ACM digital library, IEEE explore), search
engines (e.g., Google Scholar) and open access repositories (e.g. arXiv.org,
CiteSeerX). On top of this scienti c knowledge, digital libraries strive to o er useful
services, such as search, exploration, ltering, citation analysis and trend
detection. Content-based services of digital libraries rely largely on publications being
accompanied by semantic meta-data with all relevant concepts from the
ontology of the corresponding domain, such as the Medical Subject Headings (MeSH)
for Medicine and the ACM Computing Classi cation System for Computing.</p>
      <p>Some libraries employ experts to manually annotate publications at the
document level according to a domain's ontology. PubMed for example manually
indexes its collection according to MeSH. However, this entails signi cant costs
in time and money. An alternative solution is automatic indexing of
publications by computer systems utilizing text categorization technology. Automatic
indexing is important even for libraries that can a ord manual annotation for
two reasons. Firstly, it may take a couple of months from the moment a
publication enters the library to the moment it receives its annotation. For a publication
with novel and important scienti c results, this rst period of its lifetime is quite
important, yet it is this period that remains semantically invisible. Secondly,
automatic indexing can serve as assistive technology to the human annotators by
ranking the concepts according to predicted relevance to a document or ltering
out large parts of the ontology that it predicts unrelated with high con dence.</p>
      <p>
        While text mining research has progressed signi cantly in the last 10 years,
the problem of automatic indexing of scienti c publications in real-world digital
libraries presents some unique challenges that remain largely unsolved.
Realworld digital libraries curate ontologies composed of thousands of concepts and
manage collections composed of millions of publications. E cient yet accurate
learning and inference with such large ontologies and training sets is non-trivial.
The concepts in real-world ontologies are hierarchically structured as a directed
acyclic graph indicating subsumption realtions among parent and child concepts.
While some progress has been recently achieved on exploiting such relationships,
it is not entirely clear when and how these relationships help accuracy. Each
scienti c document is typically annotated with more than a concept, rendering
semantic indexing of scienti c literature a multi-label learning task [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], which
presents the additional challenge of exploiting label dependencies to improve
accuracy. Finally, as domain ontologies evolve on par with the sciencti c areas
they describe, automatic indexing models must deal with changes in the
ontology, both explicit (i.e. addition, deletion, merging of concepts) and implicit (i.e.
altered semantics of concepts) ones.
      </p>
      <p>BioASQ4 is a timely European project that o ers a perfect playground for
researchers working on these challenges. It made available a training corpus
consisting of approximately 11 million articles from MEDLINE, each one
annotated on average with approximately 13 concepts from MeSH. It also organized
a long-term real-world time-constrained benchmark. For 18 weeks, each Monday
at 17:00 o' clock CET it released a batch of recent unannotated MEDLINE
documents whose size ranged from 793 to 10,233 documents, and within 21 hours, it
requested a set of concepts for each of these documents. This paper documents
the participation of our team in this benchmark. Section 2 discusses the general
approach that we used to deal with the problem and Section 3 describes the
particular systems we used to submit annotations. Section 4 presents the results
that we achieved and Section 5 mentions the open issues left for future research.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Our Approach</title>
      <p>Starting from the 10,876,004 documents of the released training corpus, we
initially removed duplicate entries based on the pmid of the documents, which is the
unique number assigned to each PubMed record. This led to a reduced training
corpus of 10,699,707 documents.
4 http://bioasq.org/</p>
      <p>These documents belong to 8,916 journals. The challenge organizers decided
to sample test documents from 1,806 of these journals that are characterized by
small annotation time, in order to avoid delays in the evaluation of systems. We
therefore ltered the training corpus keeping only documents from these 1,806
journals, in order to make the distribution of the training documents as similar
as that of the test documents, which is a core assumption in supervised machine
learning. This was not straightforward, as the list of the 1,806 journals contained
abbreviated titles, while the training corpus contained full titles. We fortunately
managed to retrieve from the NLM catalog a text le containing both full and
abbreviated titles of all PubMed journals (J Medline.txt). This ltering process
led to a reduced corpus of 3,950,721 documents.</p>
      <p>The last 12,000 documents of this corpus was withheld as a test set, in order
to simulate a cup of the challenge, as initial guidelines for this task mentioned
that each batch would consist of approximately 2,000 documents. Note that
the corpus was sorted chronologically, so these 12,000 documents were the most
recent ones (from years 2012 and 2013).</p>
      <p>We rst extracted the title and the abstract of each document and tokenized
the text using Stanford CoreNLP5. We then lower-cased the tokens and
constructed a dictionary of unigrams and bigrams with at least 6 occurences in the
training corpus. Tf-idf values were computed for each token and normalized to
have unit length across each document.</p>
      <p>
        Learnin was based on the meta-labeler approach [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], which learns one model
for ranking labels according to their relevance with an instance and another
model for predicting the number of labels related to an instance. Given a test
instance, it selects the top N most relevant labels, where N is the prediction
of the latter model. Our implementation of the meta-labeler approach is based
on linear support vector machines (SVMs) using default parameters (cost = 1,
tolerance = 0:01, bias = 1). For ranking the labels we train a binary
classication model for each of the labels present in our training corpus, while for
predicting the number of labels we train a regression model.
      </p>
      <p>
        Our representation and learning approaches were chosen based on our
experience with a similar learning problem for the past year, the main di erence
being the availability of the full text of publications. Within that project, we
have investigated a variety of other approaches, including other thresholding
strategies [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], such as SCut [
        <xref ref-type="bibr" rid="ref4 ref5">4,5</xref>
        ], class imbalance counterfeiting approaches [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
such as majority class undersampling and asymmetric bagging [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], and di
erent representations, such as plain unigrams/bigrams, adding trigrams and BNS
scaling [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], all with worse results compared to the one we described here. We
have also unsuccessfully attempted to exploit the hierarchy information.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Particular Systems</title>
      <p>We participated in the challenge using four variations of the main approach that
was presented in Section 2. Systems 1 and 3 use respectively the last 800,000 and
5 http://nlp.stanford.edu/software/corenlp.shtml
700,000 documents of the reduced corpus prior to the 12,000 documents withheld
for testing. We did not use all 3,950,721 documents in order to reduce the time
and space complexity of the approach. System 2 is a simple ensemble, which
considers the output of the binary SVMs of System 3 for some of the labels and
the output of the meta-labeler of System 3 for the rest. This was motivated from
the observation that the meta-labeler performs worse than the binary SVMs for
some labels, especially for highly frequent ones. The choice of model per label
was tuned greedily based on the micro F-measure of the ensemble on the
heldout test set. System 4 is an ensemble of three systems similar to systems 1 and
3, each one based on a di erent 500,000 document subset of the reduced corpus.
In particular, we considered the last 1,500,000 documents prior to the 12,000
documents withheld for testing and distributed them round-robin to the three
500,000 document models. Combination of the output of these models was based
on majority voting. Table 1 presents the number of examples, unigrams, bigrams
and labels for each of the models.</p>
      <p>Experiments were run on an HP DL580R07 server featuring 4 10-core
processors at 2.26 GHz, 1 Tb of RAM and 6 10k SAS disks of capacity 600 Gb each
set up in RAID 5 for a total of 2.4 Tb storage. The server is running the Linux
CentOS operating system. The largest computational challenge was training the
thousands of binary SVM models. By utilizing parallelization at the label level
and exploiting 40 threads, training required approximately between one and two
days. Note that as predictions were required within 16 hours of test data
release, serialization was used to store the trained binary models at disk. Storing
the models of system 1 for example required 406 Gb. Parallelization at the label
level was also used during prediction.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>were equal to their title, such as Nature and Gut, as we were unaware of the
fact that abbreviations were being used in the provided list of journals of test
documents. Week 3 of the 1st cup coincided with the Orthodox Easter and we
didn't manage to make a submission.</p>
      <p>The correct version of system 3 was introduced at the 4th week of the 1st
cup, while the correct version of system 2 followed one week later. System 3
was consistently better than System 2 in both evaluation measures, with the
exception of the 2nd week of the 2nd cup and the 4th week of the 3rd cup in the
case of Micro F-measure. This shows that the tuning we did for System 2 has
most probably over tted the evaluation set and that a more careful process for
selecting the labels to be predicted directly by the corresponding binary SVMs
must be devised. At the 5th week of the 2nd cup, a parsing issue led to erratic
submissions for Systems 2 and 3. At the last week of the last cup, we accidentally
submitted erratic models for systems 1-3.</p>
      <p>System 1 was introduced at the 4th week of the 2nd cup and topped the
performance tables since then, with the exception of the erratic submission in
the very last week of the challenge. The actual values of the micro F-measure
of this best system of the challenge are around 0.57, which is not a
breathtaking performance, yet it surpassed the performance of MTI First Line Index, a
baseline system from the National Library of Medicine. When contemplating
absolute performance in this challenge, one should not forget the di culties of the
data (no availability of full text, large number of labels, complex relationships
of labels, potentially noisy labelling).</p>
      <p>System 4 was introduced at the 4th week of the 3rd cup. While its
performance in terms of the two evaluation measures wasn't good, it was the system
that outweighed the rest by far in the precision measures for the 3 weeks that
it participated in the contest. In particular its average performance in these
3 weeks was: 0.83 (example-based precision), 0.82/0.76 (micro/macro averaged
precision), 0.91 (hierarchical precision) and 0.56 (LCA precision). This excellent
performance shows that a majority voting ensemble can produce highly precise
classi ers for this task that could potentially be used for partial yet accurate
fully automatic indexing of biomedical literature.</p>
    </sec>
    <sec id="sec-5">
      <title>Open Issues</title>
      <p>
        One issue that we plan to explore in the near future concerns the temporal
dimension of the data. We have already found that the frequency of the labels varies
over time. We want to explore whether handling concept-drift [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] can bring
predictive accuracy improvements. Other issues of interest are whether journal
information can be exploited for improving predictive accuracy and whether text
of the abstract and the title of a publication should be treated separately.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Tsoumakas</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Katakis</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vlahavas</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Mining multi-label data</article-title>
          . In
          <string-name>
            <surname>Maimon</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rokach</surname>
          </string-name>
          , L., eds.
          <source>: Data Mining and Knowledge Discovery Handbook. 2nd edn</source>
          . Springer (
          <year>2010</year>
          )
          <volume>667</volume>
          {
          <fpage>685</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Tang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rajan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Narayanan</surname>
            ,
            <given-names>V.K.</given-names>
          </string-name>
          :
          <article-title>Large scale multi-label classi cation via metalabeler</article-title>
          .
          <source>In: WWW '09: Proceedings of the 18th international conference on World wide web</source>
          , New York, NY, USA, ACM (
          <year>2009</year>
          )
          <volume>211</volume>
          {
          <fpage>220</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Ioannou</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sakkas</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tsoumakas</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vlahavas</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Obtaining bipartitions from score vectors for multi-label classi cation</article-title>
          .
          <source>In: 22nd IEEE International Conference on Tools with Arti cial Intelligence (ICTAI</source>
          <year>2010</year>
          ), Los Alamitos, CA, USA, IEEE Computer Society (
          <year>2010</year>
          )
          <volume>409</volume>
          {
          <fpage>416</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>A study of thresholding strategies for text categorization</article-title>
          .
          <source>In: SIGIR '01: Proceedings of the 24th annual international ACM SIGIR conference</source>
          , New York, NY, USA, ACM (
          <year>2001</year>
          )
          <volume>137</volume>
          {
          <fpage>145</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Fan</surname>
            ,
            <given-names>R.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>C.J.:</given-names>
          </string-name>
          <article-title>A study on threshold selection for multi-label classi cation</article-title>
          .
          <source>Technical report</source>
          , National Taiwan University (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lim</surname>
            ,
            <given-names>E.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>On strategies for imbalanced text classi cation using svm: A comparative study</article-title>
          .
          <source>Decison Support Systems</source>
          <volume>48</volume>
          (
          <issue>1</issue>
          ) (
          <year>2009</year>
          )
          <volume>191</volume>
          {
          <fpage>201</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Tao</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          :
          <article-title>Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval</article-title>
          .
          <source>Pattern Analysis and Machine Intelligence</source>
          ,
          <source>IEEE Transactions on 28(7)</source>
          (
          <year>2006</year>
          )
          <volume>1088</volume>
          {
          <fpage>1099</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Forman</surname>
          </string-name>
          , G.:
          <article-title>BNS feature scaling: an improved representation over tf-idf for svm text classi cation</article-title>
          .
          <source>In: Proceedings of the 17th ACM conference on Information and knowledge management</source>
          .
          <source>CIKM '08</source>
          , New York, NY, USA, ACM (
          <year>2008</year>
          )
          <volume>263</volume>
          {
          <fpage>270</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Kosmopoulos</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Partalas</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gaussier</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paliouras</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Androutsopoulos</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Evaluation measures for hierarchical classi cation: a uni ed view and novel approaches (</article-title>
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Xiou s</surname>
            ,
            <given-names>E.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Spiliopoulou</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tsoumakas</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vlahavas</surname>
            ,
            <given-names>I.P.</given-names>
          </string-name>
          :
          <article-title>Dealing with concept drift and class imbalance in multi-label stream classi cation</article-title>
          . In Walsh, T., ed.
          <source>: IJCAI, IJCAI/AAAI</source>
          (
          <year>2011</year>
          )
          <volume>1583</volume>
          {
          <fpage>1588</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>