<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>BioC : A Conversion Tool Between BioC and convert PubAnnotation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Donald C. Comeau</string-name>
          <email>comeau@ncbi.nlm.nih.gov</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rezarta Islamaj Doğan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sun Kim</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chih-Hsuan Wei</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>W. John Wilbur</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zhiyong Lu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National Center for Biotechnology Information National Library of Medicine</institution>
          ,
          <addr-line>NIH Bethesda, MD 20894</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>- BioC is a simple XML data format for text, annotations, and relations. PubAnnotation is a repository of text annotations focused on the life science literature. A conversion tool between BioC XML and the JSON import / export format of PubAnnotation has been developed, BioCconvert. As a demonstration, the Ab3P gold standard abbreviation annotations are being made available through PubAnnotation.</p>
      </abstract>
      <kwd-group>
        <kwd>BioC</kwd>
        <kwd>PubAnnotation</kwd>
        <kwd>interoperability</kwd>
        <kwd>biomedical annotations</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>I. INTRODUCTION</title>
      <p>
        BioC is a simple data structure for text, annotations, and
relations [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. It was developed to support the BioCreative
series of workshops. It was successfully used in dedicated
BioC tracks at BioCreative IV [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and BioCreative V [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. It
was also used in other tracks such as the Comparative
Toxicogenomics Database (CTD) Curation track at
BioCreative IV [4] and the Chemical Disease Relation (CDR)
track at BioCreative V [5]. BioC annotations are specific
identified and labeled substrings of the original text. They do
not need to be continuous. They occur in a passage, or
sentence, along with, or parallel to the original text. Relations
connect an arbitrary number of annotations, or other relations,
in anyway desired. The details of a relationship should be
described in an accompanying key file.
      </p>
      <p>
        PubAnnotation is a repository of text annotations mainly
developed and maintained by DBCLS (Database Center for
Life Science) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. It focuses on annotations to the life science
literature, particularly PubMed® abstracts and PubMed
Central® (PMC®) full text articles. PubAnnotation allows for
three types of annotations: denotations, relations, and
modifications. A denotation is an indentified and labeled
portion of the original text. This is what, in other contexts, is
often simply called an annotation. A relation describes the
relationship between two denotations, as expected. A
modification changes a single denotation or relation. Supported
examples are Speculation and Negation.
      </p>
      <p>Both BioC and PubAnnotation have sizeable and growing
communities. According to Google Schoolar, the original BioC
paper has 60 citations. More than 15 papers on or using BioC
appear in PubMed. The original PubAnnotation project has 8
citations. The PubAnnotation site lists 138 projects, of which
26 have been released. PubAnnotation corpora it would be nice
to see in BioC include CoMAGC, a cancer and gene corpus,
and SPECIES800, an organism corpus. BioC corpora that
might be useful in PubAnnotation include DDIcorpus and
GeneTag. BioC tools that could be applied to PubAnnotation
corpora include abbreviation finding, NLP pipelines in C++
and Java and a number of NER tools. The benefits of
interoperability between BioC and PubAnnotation are clear.</p>
    </sec>
    <sec id="sec-2">
      <title>II. CONVERSION AND EXAMPLE</title>
      <p>PubAnnotation has a mechanism to add documents in
addition to their existing PubMed and PMC sets. Since our
example used PubMed references, no additional
PubAnnotation documents needed to be created and this
feature of PubAnnotation is not addressed. Only the
appropriate annotations needed to be created or interpreted.
When a PubAnnotation denotation is created, the text of the
enclosing passage is reported. Modifiers are used to represent
unary BioC relations, while relations represent binary BioC
relations respectively. Offsets were adjusted to refer to the
reported text. Lengths were used to calculate the end of a span.
Table 1 shows sample BioC XML annotations and the
corresponding PubAnnotation JSON.</p>
      <p>
        The conversion tool (BioCconvert) is implemented in Python.
In addition to having a BioC implementation, Python ships
with a standard JSON library. As a demonstration of this tool,
the abbreviation definition corpus created to test the Ab3P
abbreviation definition identifier [
        <xref ref-type="bibr" rid="ref7 ref8">7,8</xref>
        ] was added to
PubAnnotation. This gold standard corpus includes 1250
manually annotated MEDLINE records. It includes 1221
abbreviation-definition pairs. For an abbreviation definition,
both the abbreviation (short form) and its definition (long
form) are identified. There are a number of reasons this corpus
was chosen as the demo corpus. The concepts of abbreviation
definition is very simple and clear, so reviewing the imported
annotations for accuracy was easy. Since the relationship
between an abbreviation and its defining long form is explicict
in the corpus, importing relations could be tested in addtion to
just importing denotations.
      </p>
      <p>Importing the corpus into PubAnnotation was tested in two
ways. First, the imported corpus was exported in the
PubAnnotation format and converted back to BioC. This stable
round-trip precludes a large number of bugs. However, because
the PubAnnotation format lacks redundancy, this roundtrip
does not guarantee accuracy. The developers used visual tools
to manually review articles. This ensured the annotations were
imported accurately. At this time, PubAnnotation does not
support multi-segment denotations. Thirteen articles include at
least one multi-segment abbreviation. These were given a span
that covers all the individual spans. The Ab3P corpus is available
at http://pubannotation.org/projects/Ab3P-abbreviations. BioCconvert
will be available via a link at http://bioc.sourceforge.net.</p>
    </sec>
    <sec id="sec-3">
      <title>III. DISCUSSION</title>
      <p>BioC is a desirable datasharing format because while being
a minimalistic approach, it is also very flexible, allowing a
wide range of annotations to be represented. However, not
everything in BioC can be represented in PubAnnotation.
PubAnnotation allows for unary relations (modification) and
binary relations (relation), while BioC allows for n-ary
relations. However, unary and binary are by far the most
common. If other relation types become more common, it is
likely that PubAnnotation will support them.</p>
      <p>BioC infons (key-value pairs) allow arbitrary additional
information about each annotation to be recorded.
Unfortunately, information beyond the object type will be lost
in PubAnnotation. Nonetheless, the annotation will still be
useful in the PubAnnotation repository. Since BioC allows
arbitrary role labels in relations, manual configuration is
required to ensure that the correct BioC information is recorded
in the PubAnnotation relation “subj,” “pred,” and “obj” fields.</p>
      <p>While the intent of BioCconvert is to be general purpose,
since it has been tested on only one corpus, it is likely task
specific in unintended and undetected manners. Porting
additional annotation collections between BioC and
PubAnnotation will identify and allow correcting these
deficiencies, if they exist.</p>
    </sec>
    <sec id="sec-4">
      <title>IV. CONCLUSION</title>
      <p>With the creation of BioCconvert, one can now convert
between BioC XML and PubAnnotation JSON. It is possible
for BioC tools to be applied to any of the annotations available
from PubAnnotation. Conversely, annotations available in
BioC can be shared via PubAnnotations.
{ "denotations": [
{ "span": { "begin": 49, "end": 52 },
"obj": "ABBR",
"id": "SF0" },
{ "span": { "begin": 18, "end": 47 },
"obj": "ABBR",
"id": "LF0" }
],
"target": "http://pubannotation.org/docs/sourcedb/
PubMed/sourceid/12018411",
"sourceid": "12018411",
"sourcedb": "PubMed",
"relations": [
{ "pred": "ShortForm",
"obj": "SF0",
"subj": "LF0",
"id": "R0" }</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Comeau</surname>
            ,
            <given-names>D. C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Islamaj</surname>
            <given-names>Dogan</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Ciccarese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Cohen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. B.</given-names>
            ,
            <surname>Krallinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Leitner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            , . . .
            <surname>Wilbur</surname>
          </string-name>
          ,
          <string-name>
            <surname>W. J.</surname>
          </string-name>
          <article-title>BioC: a minimalist approach to interoperability for biomedical text processing</article-title>
          .
          <source>Database (Oxford)</source>
          ,
          <year>2013</year>
          , bat064. doi:
          <volume>10</volume>
          .1093/database/bat064.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Comeau</surname>
            ,
            <given-names>D. C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Batista-Navarro</surname>
            ,
            <given-names>R. T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dai</surname>
            ,
            <given-names>H. J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dogan</surname>
            ,
            <given-names>R. I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yepes</surname>
            ,
            <given-names>A. J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khare</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , . . .
          <string-name>
            <surname>Wilbur</surname>
            ,
            <given-names>W. J.</given-names>
          </string-name>
          <article-title>BioC interoperability track overview</article-title>
          .
          <source>Database (Oxford)</source>
          ,
          <year>2014</year>
          . doi:
          <volume>10</volume>
          .1093/database/bau053.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Islamaj</surname>
            <given-names>Doğan</given-names>
          </string-name>
          ,
          <string-name>
            <surname>R.</surname>
          </string-name>
          , Chatr-aryamontri,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Tyers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Wilbur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. J.</given-names>
            , &amp;
            <surname>Comeau</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. C.</surname>
          </string-name>
          <article-title>Overview of BioCreative V BioC Track</article-title>
          .
          <source>Paper presented at the Fifth BioCreative Challenge Evaluation Workshop</source>
          , Seville, Spain,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Wiegers</surname>
            ,
            <given-names>T. C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Davis</surname>
            ,
            <given-names>A. P.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Mattingly</surname>
            ,
            <given-names>C. J.</given-names>
          </string-name>
          <string-name>
            <surname>Web</surname>
          </string-name>
          services
          <article-title>-based text-mining demonstrates broad impacts for interoperability and process simplification</article-title>
          .
          <source>Database (Oxford)</source>
          ,
          <year>2014</year>
          . doi:
          <volume>10</volume>
          .1093/database/bau050.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          . .
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <article-title>Overview of the BioCreative V Chemical Disease Relation (CDR) Task</article-title>
          . Paper presented at the Fifth BioCreative Challenge Evaluation Workshop, Seville, Spain,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.-D.</surname>
          </string-name>
          , &amp;
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <article-title>PubAnnotation: a persistent and sharable corpus and annotation repository</article-title>
          .
          <source>Paper presented at the Proceedings of the 2012 Workshop on Biomedical Natural Language Processing</source>
          , Montreal, Canada,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Sohn</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Comeau</surname>
            ,
            <given-names>D. C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Wilbur</surname>
            ,
            <given-names>W. J.</given-names>
          </string-name>
          <article-title>Abbreviation definition identification based on automatic precision estimates</article-title>
          .
          <source>BMC Bioinformatics</source>
          ,
          <volume>9</volume>
          , 402. doi:
          <volume>10</volume>
          .1186/
          <fpage>1471</fpage>
          -2105-9-
          <issue>402</issue>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Islamaj</given-names>
            <surname>Doğan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Comeau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. C.</given-names>
            ,
            <surname>Yeganova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            , &amp;
            <surname>Wilbur</surname>
          </string-name>
          ,
          <string-name>
            <surname>W. J.</surname>
          </string-name>
          <article-title>Finding abbreviations in biomedical literature: three BioC-compatible modules and four BioC-formatted corpora</article-title>
          .
          <source>Database: The Journal of Biological Databases and Curation</source>
          ,
          <year>2014</year>
          , bau044. http://doi.org/10.1093/database/bau044 &lt;passage&gt; &lt;
          <article-title>infon key="type"&gt;title&lt;/infon&gt; &lt;offset&gt;0&lt;/offset&gt; &lt;text&gt;Comparison of two timed artificial insemination (TAI) protocols for management of first insemination postpartum</article-title>
          .&lt;/text&gt; &lt;
          <article-title>annotation id="SF0"&gt; &lt;infon key="ABBR"&gt;ShortForm&lt;/infon&gt; &lt;infon key="type"&gt;ABBR&lt;/infon&gt; &lt;location offset="49" length="3"/&gt; &lt;text&gt;TAI&lt;/text&gt; &lt;/annotation&gt; &lt;annotation id="LF0"&gt; &lt;infon key="ABBR"&gt;LongForm&lt;/infon&gt; &lt;infon key="type"&gt;ABBR&lt;/infon&gt; &lt;location offset="18" length="29"/&gt; &lt;text&gt;timed artificial insemination&lt;/text&gt; &lt;/annotation&gt; &lt;relation id="R0"&gt; &lt;infon key="type"&gt;ABBR&lt;/infon&gt; &lt;node refid="LF0" role="LongForm"/&gt; &lt;node refid="SF0" role="ShortForm"</article-title>
          /&gt; &lt;/relation&gt; &lt;/passage&gt;
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>