<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>BIOTEX: A system for Biomedical Terminology Extraction, Ranking, and Validation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Juan Antonio Lossio-Ventura</string-name>
          <email>juan.lossio@lirmm.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Clement Jonquet</string-name>
          <email>jonquet@lirmm.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mathieu Roche</string-name>
          <email>mathieu.roche@cirad.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maguelonne Teisseire</string-name>
          <email>maguelonne.teisseire@teledetection.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Irstea</institution>
          ,
          <addr-line>CIRAD, TETIS - Montpellier</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Montpellier 2, LIRMM</institution>
          ,
          <addr-line>CNRS - Montpellier</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Term extraction is an essential task in domain knowledge acquisition. Although hundreds of terminologies and ontologies exist in the biomedical domain, the language evolves faster than our ability to formalize and catalog it. We may be interested in the terms and words explicitly used in our corpus in order to index or mine this corpus or just to enrich currently available terminologies and ontologies. Automatic term recognition and keyword extraction measures are widely used in biomedical text mining applications. We present BIOTEX, a Web application that implements state-of-the-art measures for automatic extraction of biomedical terms from free text in English and French.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Within a corpus, there is different information to represent, with different communities
to express that information. Therefore, the terminology and vocabulary is often very
corpus specific and not explicitly defined. For instance in medical world, terms
employed by lay users on a forum will necessarily differ from the vocabulary used by
doctors in electronic health records. We thus intend to offer users an opportunity to
automatically extract biomedical terms and use them for any natural language, indexing,
knowledge extraction, or annotation purpose. Extracted terms can also be used to enrich
biomedical ontologies or terminologies by offering new terms or synonyms to attach to
existing defined classes. Automatic Term Extraction (ATE) methods are designed to
automatically extract relevant terms from a given corpus.1. Relevant terms are useful to
gain further insight into the conceptual structure of a domain. In the biomedical domain,
there is a substantial difference between existing resources (ontologies) in English and
French. In English there are about 7 000 000 terms associated with about 6 000 000
concepts such as those in UMLS or BioPortal [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Whereas, in French there are only
about 330 000 terms associated with about 160 000 concepts [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. French ontologies
therefore have to be populated and tool like BIOTEXwill help for this task. Our project
involves two main stages: (i) Biomedical term extraction, and (ii) Ontology population,
in order to populate ontologies with the extracted terms.
1 We refer to ATE when terms extracted are not previously defined in existing standard
ontologies or terminologies. We refer to ’semantic annotation’ when term extracted can be attached
or match to an existing class (URI) such as in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Both approaches are related to Named Entity
Recognition (NER), which automatically extracts name of entities (disease, person, city).
      </p>
      <p>
        In this paper, we present BIOTEX, an application that performs the first step. Given
a text corpus, it extracts and ranks biomedical terms according to the selected
stateof-the-art extraction measure. In addition, BIOTEX automatically validates terms that
already exist in UMLS/MeSH-fr terminologies. We have presented different measures
and performed comparative assessments in other publications [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ]. In this paper, we
focus on the presentation of BIOTEX and the use cases it supports.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2 Related work and available extraction measures</title>
      <p>
        Term extraction techniques can be divided into four broad categories: (i) Linguistic
approaches attempt to recover terms via linguistic patterns [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. (ii) Statistical
methods focus on external evidence through contextual information. Similar methods, called
Automatic Keyword Extraction (AKE), are geared towards extracting the most relevant
words or phrases in a document. These measures, such as Okapi BM25 and TF-IDF,
can be used to automatically extract biomedical terms, as we proposed in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. These
two measures are included in BIOTEX. (iii) Machine Learning is often designed for
specific entity classes and thus integrate term extraction and term classification. (iv)
Hybrid methods. Most approaches combine several methods (typically linguistic and
statistically based) for the term extraction task. This is the case of C-value [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], a very
popular measure specialized in multi-word and nested term extraction.
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], we proposed the new hybrid measures F-TFIDF-C and F-OCapi, which
combine C-value with TF-IDF and Okapi respectively to extract terms and obtain better
results than C-value. In [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], we propose LIDF-value measure based on linguistic and
statistical information. We offer all of these measures within BIOTEX. Our measures
were evaluated in terms of precision [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ] and obtained the best results over the top
k extracted terms (P @k) on several corpora (LabTestOnline, GENIA, PubMed). For
instance, on a GENIA corpus, LIDF-value achieved 82% for P @100, thus improving
the C-value precision by 13%, and 66% for P @2000, with an improvement of 11%.
BIOTEX allows users to assess the performances of measures with different corpora.
      </p>
      <p>A detailed study of related work revealed that most existing systems
implementing statistical methods are made to extract keywords and, to a lesser extent, to extract
terminology from a text corpus. Indeed, most systems take a single text document as
input, not a set of documents (as corpus), for which the IDF can be computed. Most
systems are available only in English. Table 1 shows a quick comparison with TerMine
(C-value), the most commonly used application, and FlexiTerm, the most recent one.</p>
    </sec>
    <sec id="sec-3">
      <title>Implementation of BIOTEX</title>
      <p>BIOTEX is an application for biomedical terminology extraction which offers several
baselines and new measures to rank candidate terms for a given text corpus.
BIOTEX can be used either as: (i) a Web application taking a text file as input, or (ii) as
a Java library. When used as a Web application, it produces a file with a maximum of
1200 ranked candidate terms. Used as a Java library, it produces four files with ranked
candidate terms found in the corpus, respectively, unigram, bigram, 3-gram and all the
4+ gram terms. BIOTEX supports two main use cases:
(1) Term extraction and ranking measures: As illustrated by the Web application
interface, Figure 1 (1), BIOTEX users can customize the workflow by changing the
following parameters:
– Choose the corpus language (i.e., English or French), and the Part-of-Speech
(PoS) tagger to apply. Note that we tested three POS-tagger tools but currently
only TreeTagger is available within BIOTEX.
– Select a number of patterns to filter out the candidate terms (200 by default).</p>
      <p>Those reference patterns (e.g., noun-noun, noun-prep-noun, etc.) were built
with terms taken from UMLS for English and MeSH-fr for French. They are
ranked by frequency.
– Select the type of terms to extract: all terms (i.e., single- and multi-word terms)
or multi-word terms only.</p>
      <p>– Select the ranking measures to apply.
(2) Validation of candidate terms: After the extraction process, BIOTEX
automatically validates the extracted terms by using UMLS (Eng) &amp; MeSH-fr (Fr). As
illustrated in Figure 1 (2), these validated terms are displayed in green, specifying
the used knowledge source and the others in red. Therefore, BIOTEX allows
someone to easily distinguish the classes annotating the original corpus (in green) from
the terms that maybe also considered relevant for their data, but need to be curated
(in red). The last ones may be considered candidates for ontology enrichment.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions and Future Work</title>
      <p>
        In this article, we present the BIOTEX application for biomedical terminology
extraction. It is available for online testing and evaluation but can also be used in any
program as a Java library (POS tagger not included). In contrast to other existing
systems, this system allows us to analyze a French corpus, manually validate extracted
terms and export the list of extracted terms. We hope that BIOTEX will be a valuable
tool for the biomedical community. It is currently used in a couple of test-beds within
the SIFR project (http://www.lirmm.fr/sifr). The application is available at
http://tubo.lirmm.fr/biotex/ along with a video demonstration http:
//www.youtube.com/watch?v=EBbkZj7HcL8. For our future validations, we
will enrich our validation dictionaries with BioPortal [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] terms for English and CISMeF
[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] terms for French. In the future, we will offer disambiguation features using the Web
to find the context in order to populate biomedical ontologies with the new extracted
terms (red terms), while looking into the possibility of extracting relations [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] between
new terms and already known terms.
      </p>
      <p>Acknowledgments. This work was supported in part by the French National Research Agency
under JCJC program, grant ANR-12-JS02-01001, as well as by University of Montpellier 2,
CNRS, IBC of Montpellier project and the FINCyT program, Peru</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Abacha</surname>
            ,
            <given-names>A. B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zweigenbaum</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Automatic extraction of semantic relations between medical entities: a rule based approach</article-title>
          .
          <source>Journal of Biomedical Semantics</source>
          , vol.
          <volume>2</volume>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Frantzi</surname>
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ananiadou</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mima</surname>
          </string-name>
          , H.:
          <article-title>Automatic recognition of multiword terms: the Cvalue/NC-value Method</article-title>
          .
          <source>International Journal on Digital Libraries</source>
          , vol.
          <volume>3</volume>
          , pp.
          <fpage>115</fpage>
          -
          <lpage>130</lpage>
          , (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Gaizauskas</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Demetriou</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Humphreys</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Term recognition, classification in biological science journal articles</article-title>
          .
          <source>Proceeding of the Computional Terminology for Medical, Biological Applications Workshop of the 2 nd International Conference on NLP</source>
          , pp.
          <fpage>37</fpage>
          -
          <lpage>44</lpage>
          (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Lossio-Ventura</surname>
            ,
            <given-names>J.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jonquet</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roche</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Teisseire</surname>
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Towards a Mixed Approach to Extract Biomedical Terms from Text Corpus</article-title>
          .
          <source>International Journal of Knowledge Discovery in Bioinformatics, IGI Global</source>
          . vol.
          <volume>4</volume>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>15</lpage>
          , Hershey, PA, USA (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Lossio-Ventura</surname>
            ,
            <given-names>J.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jonquet</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roche</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Teisseire</surname>
            <given-names>M.:</given-names>
          </string-name>
          <article-title>Yet another ranking function to automatic multi-word term extraction</article-title>
          .
          <source>Proceedings of the 9th International Conference on Natural Language Processing (PolTAL'14)</source>
          , Springer LNAI. Warsaw, Poland (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Neveol</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grosjean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Darmoni</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zweigenbaum</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Language Resources for French in the Biomedical Domain</article-title>
          .
          <source>9th International Conference on Language Resources and Evaluation (LREC'14)</source>
          . Reykjavik, Iceland (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Noy</surname>
            ,
            <given-names>N. F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shah</surname>
            ,
            <given-names>N. H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Whetzel</surname>
            ,
            <given-names>P. L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dai</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dorf</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Griffith</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jonquet</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rubin</surname>
            ,
            <given-names>D. L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Storey</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chute</surname>
            ,
            <given-names>C.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Musen</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          :
          <article-title>BioPortal: ontologies and integrated data resources at the click of a mouse</article-title>
          .
          <source>Nucleic acids research</source>
          , vol.
          <volume>37</volume>
          (
          <issue>suppl 2</issue>
          ), pp
          <fpage>170</fpage>
          -
          <lpage>173</lpage>
          . (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Jonquet</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shah</surname>
            ,
            <given-names>N.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Youn</surname>
            ,
            <given-names>C.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Callendar</surname>
            , Chris, Storey,
            <given-names>M-A</given-names>
          </string-name>
          , Musen,
          <string-name>
            <surname>M.A.</surname>
          </string-name>
          : NCBO Annotator:
          <article-title>Semantic Annotation of Biomedical Data</article-title>
          . 8th International Semantic Web Conference, Poster and Demonstration Session Washington DC, USA (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Darmoni</surname>
            ,
            <given-names>S.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pereira</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sakji</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Merabti</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prieur</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Joubert</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thirion</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <source>Multiple Terminologies in a Health Portal: Automatic Indexing and Information Retrieval. 12th Conference on Artificial Intelligence in Medicine, LNCS 5651</source>
          , pp.
          <fpage>255</fpage>
          -
          <lpage>259</lpage>
          , Verona, Italy (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>