BioCconvert: A Conversion Tool Between BioC and PubAnnotation Donald C. Comeau, Rezarta Islamaj Doğan, Sun Kim, Chih-Hsuan Wei, W. John Wilbur and Zhiyong Lu National Center for Biotechnology Information National Library of Medicine, NIH Bethesda, MD 20894, USA comeau@ncbi.nlm.nih.gov Abstract— BioC is a simple XML data format for text, to see in BioC include CoMAGC, a cancer and gene corpus, annotations, and relations. PubAnnotation is a repository of text and SPECIES800, an organism corpus. BioC corpora that annotations focused on the life science literature. A conversion might be useful in PubAnnotation include DDIcorpus and tool between BioC XML and the JSON import / export format of GeneTag. BioC tools that could be applied to PubAnnotation PubAnnotation has been developed, BioCconvert. As a corpora include abbreviation finding, NLP pipelines in C++ demonstration, the Ab3P gold standard abbreviation annotations and Java and a number of NER tools. The benefits of are being made available through PubAnnotation. interoperability between BioC and PubAnnotation are clear. Keywords—BioC, PubAnnotation, interoperability, biomedical annotations II. CONVERSION AND EXAMPLE PubAnnotation has a mechanism to add documents in I. INTRODUCTION addition to their existing PubMed and PMC sets. Since our BioC is a simple data structure for text, annotations, and example used PubMed references, no additional relations [1]. It was developed to support the BioCreative PubAnnotation documents needed to be created and this series of workshops. It was successfully used in dedicated feature of PubAnnotation is not addressed. Only the BioC tracks at BioCreative IV [2] and BioCreative V [3]. It appropriate annotations needed to be created or interpreted. was also used in other tracks such as the Comparative When a PubAnnotation denotation is created, the text of the Toxicogenomics Database (CTD) Curation track at enclosing passage is reported. Modifiers are used to represent BioCreative IV [4] and the Chemical Disease Relation (CDR) unary BioC relations, while relations represent binary BioC track at BioCreative V [5]. BioC annotations are specific relations respectively. Offsets were adjusted to refer to the identified and labeled substrings of the original text. They do reported text. Lengths were used to calculate the end of a span. not need to be continuous. They occur in a passage, or Table 1 shows sample BioC XML annotations and the sentence, along with, or parallel to the original text. Relations corresponding PubAnnotation JSON. connect an arbitrary number of annotations, or other relations, The conversion tool (BioCconvert) is implemented in Python. in anyway desired. The details of a relationship should be In addition to having a BioC implementation, Python ships described in an accompanying key file. with a standard JSON library. As a demonstration of this tool, PubAnnotation is a repository of text annotations mainly the abbreviation definition corpus created to test the Ab3P developed and maintained by DBCLS (Database Center for abbreviation definition identifier [7,8] was added to Life Science) [6]. It focuses on annotations to the life science PubAnnotation. This gold standard corpus includes 1250 literature, particularly PubMed® abstracts and PubMed manually annotated MEDLINE records. It includes 1221 Central® (PMC®) full text articles. PubAnnotation allows for abbreviation-definition pairs. For an abbreviation definition, three types of annotations: denotations, relations, and both the abbreviation (short form) and its definition (long form) are identified. There are a number of reasons this corpus modifications. A denotation is an indentified and labeled portion of the original text. This is what, in other contexts, is was chosen as the demo corpus. The concepts of abbreviation definition is very simple and clear, so reviewing the imported often simply called an annotation. A relation describes the relationship between two denotations, as expected. A annotations for accuracy was easy. Since the relationship between an abbreviation and its defining long form is explicict modification changes a single denotation or relation. Supported examples are Speculation and Negation. in the corpus, importing relations could be tested in addtion to just importing denotations. Both BioC and PubAnnotation have sizeable and growing communities. According to Google Schoolar, the original BioC Importing the corpus into PubAnnotation was tested in two paper has 60 citations. More than 15 papers on or using BioC ways. First, the imported corpus was exported in the appear in PubMed. The original PubAnnotation project has 8 PubAnnotation format and converted back to BioC. This stable citations. The PubAnnotation site lists 138 projects, of which round-trip precludes a large number of bugs. However, because 26 have been released. PubAnnotation corpora it would be nice the PubAnnotation format lacks redundancy, this roundtrip does not guarantee accuracy. The developers used visual tools This research was supported by the Intramural Research Program of the NIH, National Library of Medicine. to manually review articles. This ensured the annotations were IV. CONCLUSION imported accurately. At this time, PubAnnotation does not With the creation of BioCconvert, one can now convert support multi-segment denotations. Thirteen articles include at between BioC XML and PubAnnotation JSON. It is possible least one multi-segment abbreviation. These were given a span for BioC tools to be applied to any of the annotations available that covers all the individual spans. The Ab3P corpus is available from PubAnnotation. Conversely, annotations available in at http://pubannotation.org/projects/Ab3P-abbreviations. BioCconvert BioC can be shared via PubAnnotations. will be available via a link at http://bioc.sourceforge.net. REFERENCES III. DISCUSSION [1] Comeau, D. C., Islamaj Dogan, R., Ciccarese, P., Cohen, K. B., BioC is a desirable datasharing format because while being Krallinger, M., Leitner, F., . . . Wilbur, W. J. BioC: a minimalist a minimalistic approach, it is also very flexible, allowing a approach to interoperability for biomedical text processing. Database wide range of annotations to be represented. However, not (Oxford), 2013, bat064. doi:10.1093/database/bat064. everything in BioC can be represented in PubAnnotation. [2] Comeau, D. C., Batista-Navarro, R. T., Dai, H. J., Dogan, R. I., Yepes, A. J., Khare, R., . . . Wilbur, W. J. BioC interoperability track overview. PubAnnotation allows for unary relations (modification) and Database (Oxford), 2014. doi:10.1093/database/bau053. binary relations (relation), while BioC allows for n-ary [3] Kim, S., Islamaj Doğan, R., Chatr-aryamontri, A., Tyers, M., Wilbur, relations. However, unary and binary are by far the most W. J., & Comeau, D. C. Overview of BioCreative V BioC Track. Paper common. If other relation types become more common, it is presented at the Fifth BioCreative Challenge Evaluation Workshop, likely that PubAnnotation will support them. Seville, Spain, 2015. [4] Wiegers, T. C., Davis, A. P., & Mattingly, C. J. Web services-based BioC infons (key-value pairs) allow arbitrary additional text-mining demonstrates broad impacts for interoperability and process information about each annotation to be recorded. simplification. Database (Oxford), 2014. doi:10.1093/database/bau050. Unfortunately, information beyond the object type will be lost [5] Wei, C.-H., Peng, Y., Leaman, R., Davis, A. P., Mattingly, C. J., Li, J., . in PubAnnotation. Nonetheless, the annotation will still be . . Lu, Z. Overview of the BioCreative V Chemical Disease Relation useful in the PubAnnotation repository. Since BioC allows (CDR) Task. Paper presented at the Fifth BioCreative Challenge arbitrary role labels in relations, manual configuration is Evaluation Workshop, Seville, Spain, 2015. required to ensure that the correct BioC information is recorded [6] Kim, J.-D., & Wang, Y. PubAnnotation: a persistent and sharable corpus and annotation repository. Paper presented at the Proceedings of the in the PubAnnotation relation “subj,” “pred,” and “obj” fields. 2012 Workshop on Biomedical Natural Language Processing, Montreal, While the intent of BioCconvert is to be general purpose, Canada, 2012. since it has been tested on only one corpus, it is likely task [7] Sohn, S., Comeau, D. C., Kim, W., & Wilbur, W. J. Abbreviation definition identification based on automatic precision estimates. BMC specific in unintended and undetected manners. Porting Bioinformatics, 9, 402. doi:10.1186/1471-2105-9-402, 2008. additional annotation collections between BioC and [8] Islamaj Doğan, R., Comeau, D. C., Yeganova, L., & Wilbur, W. J. PubAnnotation will identify and allow correcting these Finding abbreviations in biomedical literature: three BioC-compatible deficiencies, if they exist. modules and four BioC-formatted corpora. Database: The Journal of Biological Databases and Curation, 2014, bau044. http://doi.org/10.1093/database/bau044 TABLE I. EQUIVALENT BIOC XML AND PUBANNOTATION JSON FOR THE SAME TEXT AND ABBREVIATION DEFINITION ANNOTATIONS. THE COLOR CODED SECTIONS INDICATE EQUIVALENT INFORMATION. YELLOW: TEXT, RED: ID, PURPLE: OFFSET, GREEN: LENGTH, OR END, OF ANNOTATION. BioC XML PubAnnotation JSON [ title { "denotations": [ 0 { "span": { "begin": 49, "end": 52 }, Comparison of two timed artificial insemination (TAI) "obj": "ABBR", protocols for management of first insemination postpartum. "id": "SF0" }, { "span": { "begin": 18, "end": 47 }, ShortForm "obj": "ABBR", ABBR "id": "LF0" } ], TAI "target": "http://pubannotation.org/docs/sourcedb/ PubMed/sourceid/12018411", "sourceid": "12018411", LongForm "sourcedb": "PubMed", ABBR "relations": [ { "pred": "ShortForm", timed artificial insemination "obj": "SF0", "subj": "LF0", "id": "R0" } ABBR ], "project": "Ab3P_abbreviations", "text": "Comparison of two timed artificial insemination (TAI) protocols for management of first insemination postpartum." } ]