Bluima: a UIMA-based NLP Toolkit for
                    Neuroscience

           Renaud Richardet, Jean-Cédric Chappelier, Martin Telefont

               Blue Brain Project, EPFL, 1015, Lausanne, Switzerland
                            renaud.richardet@epfl.ch


        Abstract. This paper describes Bluima, a natural language process-
        ing (NLP) pipeline focusing on the extraction of neuroscientific content
        and based on the UIMA framework. Bluima builds upon models from
        biomedical NLP (BioNLP) like specialized tokenizers and lemmatizers.
        It adds further models and tools specific to neuroscience (e.g. named
        entity recognizer for neuron or brain region mentions) and provides col-
        lection readers for neuroscientific corpora.
        Two novel UIMA components are proposed: the first allows configuring
        and instantiating UIMA pipelines using a simple scripting language, en-
        abling non-UIMA experts to design and run UIMA pipelines. The second
        component is a common analysis structure (CAS) store based on Mon-
        goDB, to perform incremental annotation of large document corpora.

        Keywords: UIMA, natural language processing, NLP, neuroinformat-
        ics, NoSQL


1     Introduction

Bluima started as an effort to develop a high performance natural language
processing (NLP) toolkit for neuroscience. The goal was to extract structured
knowledge from biomedical literature (PubMed1 ), in order to help neuroscientists
gather data to specify parameters for their models. In particular, focus was set
on extracting entities that are specific to neuroscience (like brain regions and
neurons) and that are not yet covered by existing text processing systems.
    After careful evaluation of different NLP frameworks, the UIMA software
system was selected for its open standards, its performance and stability, and
its usage in several other biomedical NLP (bioNLP) projects; e.g. JulieLab [11],
ClearTK [22], DKPRo [6], cTAKES [28], ccp-nlp, U-Compare [15], SciKnowMine
[26], Argo [25]. Initial development went fast and several existing bioNLP models
and UIMA components could rapidly be reused or integrated into UIMA without
the need to modify its core system, as presented in Section 2.1.
    Once the initial components were in place, an experimentation phase started
where different pipelines were created, each with different components and pa-
rameters. Pipeline definition in verbose XML was greatly improved by the use
1
    http://www.ncbi.nlm.nih.gov/pubmed
of UIMAFit [21] (to define pipelines in compact Java code) but ended up be-
ing problematic, as it requires some Java knowledge and recompilation for each
component or parameter change. To allow for a more agile prototyping, espe-
cially by non-specialist end users, a pipeline scripting language was created. It
is described in Section 2.2.
    Another concern was incremental annotation of large document corpus. For
example, when running an initial pre-processing pipeline on several millions of
documents, and then annotating them again at a later time. The initial strategy
was to store the documents on disk, and overwrite them every time they would
be incrementally annotated. Eventually, a CAS store module was developed to
provide a stable and scalable strategy for incremental annotation, as described
in Section 2.3. Finally, Section 3 presents two case studies illustrating the script-
ing language and evaluating the performance of the CAS store against existing
serialization formats.

2     Bluima Components
Bluima contains several UIMA modules to read neuroscientific corpora, perform
preprocessing, create simple configuration files to run pipelines, and persist doc-
uments on the disk.

2.1   UIMA Modules
Bluima’s typesystem builds upon the typesystem from JulieLab [10], which was
chosen for its strong biomedical orientation and its clean architecture. Bluima’s
typesystem adds neuroscientific annotations, like CellType, BrainRegion, etc.
    Bluima includes several collection readers for selected neuroscience cor-
pora, like PubMed XML dumps, PubMed Central NXML files, the BioNLP
2011 GENIA Event Extraction corpus [24], the Biocreative2 annotated corpus
[16], the GENIA annotated corpus [14], and the WhiteText brain regions corpus
[8]. A PDF reader was developed to provide robust and precise text extrac-
tion from scientific articles in PDF format. The PDF reader performs content
correction and cleanup, like dehyphenation, removal of ligatures, glyph mapping
correction, table detection, and removal of non-informative footers and headers.
    For pre-processing, the OpenNLP-wrappers developed by JulieLab for sen-
tence segmentation, word tokenization and part-of-speech tagging [31] were used
and updated to UIMAFit. Lemmatization is performed by the domain-specific
tool BioLemmatizer [19].Abbreviation recognition (the task of identifying abbre-
viations in text) is performed by BIOADI, a supervised machine learning model
trained on the BIOADI corpus [17].
    Bluima uses UIMA’s ConceptMapper [29] to build lexical-based NERs
based on several neuroscientific lexica and ontologies (Table 1). These lexica
and ontologies were either developed in-house or were imported from existing
sources. Bluima wraps several machine learning-based NERs, like OSCAR4
[13] (chemicals, reactions), Linnaeus [9] (species), BANNER [18] (genes and pro-
teins), and Gimli [5] (proteins).
    Name          Source           Scope                                 # forms
    Age           BlueBrain        age of organism, developmental stage      138
    Sex           BlueBrain        sex (male, female) and variants            10
    Method        BlueBrain        experimental methods in neuroscience       43
    Organism      BlueBrain        organisms used in neuroscience            121
    Cell          BlueBrain        cell, sub-cell and region                 862
    Ion channel Channelpedia [27] ion channels                               868
    Uniprot       Uniprot [1]      genes and proteins                    143,757
    Biolexicon    Biolexicon [30]  unified lexicon of biomedical terms   2.2 Mio
    Verbs         Biolexicon       verbs extracted from the Biolexicon     5,038
    Cell ontology OBO [2]          cell types (prokaryotic to mammalian)   3,564
    Disease ont. OBO [23]          human disease ontology                 24,613
    Protein ont. OBO [20]          protein-related entities               29,198
    Brain region Neuronames [3] hierarchy of brain regions                 8,211
    Wordnet       Wordnet [7]      general English                       155,287
    NIFSTD        NIF [12,4]       neuroscience ontology                  16,896
               Table 1. Lexica and ontologies used for lexical matching.


2.2     Pipeline Scripting Language


    Tool                Advantages      Disadvantages
    UIMA GUI            GUI             minimalistic UI, can not reuse pipelines
    XML descriptor      typed (schema) very verbose
    raw UIMA java API typed             verbose, requires writing and compiling Java
    UIMAFit             compact, typed requires writing and compiling Java code
         Table 2. Different approaches to writing and running UIMA pipelines.


    There are several approaches2 to write and run UIMA pipelines (see Table 2).
All Bluima components were initially written in Java with the UIMAFit library,
that allows for compact code. To improve the design and experimentation with
UIMA pipelines, and enable researchers without Java or UIMA knowledge to
easily design and run such pipelines, a minimalistic scripting (domain-specific)
language was developed, allowing UIMA pipelines to be configured with text files,
in a human-readable format (Table 3). A pipeline script begins with the definition
of a collection reader (starting with cr:), followed by several annotation engines
(starting with ae:)3 . Parameter specification starts with a space, followed by the
2
  Other interesting solutions exist (e.g. IBM LanguageWare, Argo), but are not open
  source.
3
  If not package namespace is specified, Bluima loads Readers and Annotator classes
  from the default namespace.
parameter name, a column and its value. The scripting language also supports
embedding of inline Python and Java code, reuse of a portion of a pipeline with
include statements, and variable substitution similar to shell scripts. Extensive
documentation (in particular snippets of scripts) is automatically generated for
all components, using the JavaDoc and the UIMAFit annotations.

2.3    CAS Store
A CAS store was developed to persist annotated documents, resume their pro-
cessing and add new annotations to them. This CAS store was motivated by the
common use case of repetitively and incrementally processing the same docu-
ments with different UIMA pipelines, where some pipeline steps are duplicated
among the runs. For example, when performing resource-intensive operations
(like extracting the text from full-text PDF articles, or performing syntactic pars-
ing), one might want to perform these preliminary operation once, store these
results, and subsequently perform different experiments with different UIMA
modules and parameters. The CAS store thus allows to perform the preprocess-
ing only once, to then persist the annotated documents, and to perform the
various experiments in parallel.
    MongoDB4 was selected as the datastore backend. MongoDB is a scalable,
high-performance, open-source, schema-free (NoSQL), document-oriented data-
base. No schema is required on the database side, since the UIMA typesystem
acts as a schema, and data is validated on-the-fly by the module. Every CAS is
stored as a MongoDB document, along with its annotations. UIMA annotations
and their features are explicitly mapped to MongoDB fields, using a simple and
declarative language. For example, a Protein annotation is mapped to a prot
field in MongoDB. The mappings are used when persisting and loading from
the database. As of this writing, annotations are declared in Java source files.
In future versions, we plan to store mappings directly in MongoDB to improve
flexibility. Persistence of complex typesystem has not been implemented yet, but
could be easily added in the future.
    Currently, the following UIMA components are available for the CAS store:
 – MongoCollectionReader reads CAS from a MongoDB collection. Optionally,
   a (filter) query can be specified;
 – RegexMongoCollectionReader is similar to MongoCollectionReader but al-
   lows specifying a query with a regular expression on a specific field;
 – MongoWriter persists new UIMA CASes into MongoDB documents;
 – MongoUpdateWriter persists new annotations into an existing document;
 – MongoCollectionRemover removes selected annotations in a MongoDB col-
   lection.
With the above components, it is possible within a single pipeline to read an
existing collection of annotated documents, perform some further processing, add
more annotations, and store theses annotations back into the same MongoDB
documents.
4
    http://www.mongodb.org/
3     Case Studies and Evaluation

A first experiment to illustrate the scripting language was conducted on a large
dataset of full-text biomedical articles. A second simulated experiment evalu-
ates the performance of the MongoDB CAS store against existing serialization
formats.


3.1   Scripting and Scale-Out


# collection reader configured with a list of files (provided as external params)
cr: FromFilelistReader
 inputFile: $1
# processes the content of the PDFs
ae: ch.epfl.bbp.uima.pdf.cr.PdfCollectionAnnotator

# tokenization and lematization
ae: SentenceAnnotator
 modelFile: $ROOT/modules/julielab_opennlp/models/sentence/PennBio.bin.gz
ae: TokenAnnotator
 modelFile: $ROOT/modules/julielab_opennlp/models/token/Genia.bin.gz
ae: BlueBioLemmatizer

# lexical NERs, instantiated with some helper java code
ae_java: ch.epfl.bbp.uima.LexicaHelper.getConceptMapper("/bbp_onto/brainregion")
ae_java: ch.epfl.bbp.uima.LexicaHelper.getConceptMapper("/bams/bams")

# removes duplicate annotations and extracts collocated brainregion annotations
ae: DeduplicatorAnnotator
 annotationClass: ch.epfl.bbp.uima.types.BrainRegionDictTerm
ae: ExtractBrainregionsCoocurrences
 outputDirectory: $2


Table 3. Pipeline script for the extraction of brain regions mention co-occurrences
from PDF documents.


     Bluima was used to extract brain region mention co-occurrences from scien-
tific articles in PDF. The pipeline script (Table 3) was created and tested on
a development laptop. Scale-out was performed on a 12-node (144-core) clus-
ter managed by SLURM (Simple Linux Utility for Resource Management). The
383,795 PDFs were partitioned in 767 jobs. Each job was instantiated with the
same pipeline script, using different input and output parameters. The process-
ing completed in 809 minutes (' 8 PDF/s).


3.2   MongoDB CAS Store

The MongoDB CAS store (MCS) has been evaluated against 3 other available
serialization formats (XCAS, XMI and ZIPXMI). For each, 3 settings were eval-
uated: writes (CASes are persisted to disk), reads (CASes are loaded from their
persisted states), and incremental (CASes are first read from their persisted
                   Write [s]                Write Size [MB]
    XCAS              4014                           41718
    XMI               4479                           32236
    ZIPXMI            5033                             4677
    MongoDB           3281                           16724

                   Read [s]                 Incremental [s]
    XCAS              3407                            31.7
    XMI               3090                            42.2
    ZIPXMI            2790                            43.6
    MongoDB            730                            22.5

Fig. 1. Performance evaluation of MongoDB CAS Store against 3 other serialization
formats.


states, then further processed, and finally persisted again to disk). Writes and
reads were performed on a random sample of 500,000 PubMed abstracts and an-
notated with all available Bluima NERs. Incremental annotation was performed
on a random sample of 5,000 PubMed abstracts and incrementally annotated
with the Stopwords annotator. Processing time and disk space was measured on
a commodity laptop (4 cores, 8GB RAM).
    In terms of speed, the MCS significantly outperforms the other formats, espe-
cially for reads (Figure 1). The MCS disk size is significantly smaller than XCAS
and XMI formats, but almost 4 times larger than the compressed ZIPXMI for-
mat. The incremental annotation is significantly faster with MongoDB, and does
not require duplicating or overwriting files, like with the other serialization for-
mats. The MCS could be scaled up in a cluster setup, or using solid states drives
(SSDs). Writes could probably be improved by turning MongoDB’s ”safe mode”
option off. Furthermore, by adding indexes, the MCS can act as a searchable
annotation database.


4    Conclusions and Future Work

In the process of developing Bluima, a toolkit for neuroscientific NLP, we inte-
grated and wrapped several specialized resources to process neuroscientific arti-
cles. We also created two UIMA modules (scripting language and CAS store).
These additions proved to be very effective in practice and allowed us to leverage
UIMA, an enterprise-grade framework, while at the same time allowing an agile
development and deployment of NLP pipelines.
    In the future, we will open-source Bluima and add more models for NER and
relationship extraction. We also plan to ease the deployment of Bluima (and its
scripting language) on a Hadoop cluster.
References
 1. Bairoch, A., Apweiler, R., Wu, C.H., Barker, W.C., Boeckmann, B., Ferro, S.,
    Gasteiger, E., Huang, H., Lopez, R., Magrane, M.: The universal protein resource
    (UniProt). Nucleic acids research 33(suppl 1), D154–D159 (2005)
 2. Bard, J., Rhee, S.Y., Ashburner, M.: An ontology for cell types. Genome Biology
    6(2) (2005)
 3. Bowden, D., Dubach, M.: NeuroNames 2002. Neuroinformatics 1(1), 43–59 (2003)
 4. Bug, W.J., Ascoli, G.A., Grethe, J.S., Gupta, A., Fennema-Notestine, C., Laird,
    A.R., Larson, S.D., Rubin, D., Shepherd, G.M., Turner, J.A.: The NIFSTD and
    BIRNLex vocabularies: building comprehensive ontologies for neuroscience. Neu-
    roinformatics 6(3), 175–194 (2008)
 5. Campos, D., Matos, S., Oliveira, J.L.: Gimli: open source and high-performance
    biomedical name recognition. BMC Bioinformatics 14(1), 54 (Feb 2013)
 6. De Castilho, R.E., Gurevych, I.: DKPro-UGD: a flexible data-cleansing approach
    to processing user-generated discourse. In: Onlineproceedings of the First French-
    speaking meeting around the framework Apache UIMA, LINA CNRS UMR (2009)
 7. Fellbaum, C.: WordNet. Theory and Applications of Ontology: Computer Appli-
    cations p. 231–243 (2010)
 8. French, L., Lane, S., Xu, L., Pavlidis, P.: Automated recognition of brain region
    mentions in neuroscience literature. Front Neuroinformatics 3 (Sep 2009)
 9. Gerner, M., Nenadic, G., Bergman, C.: Linnaeus: A species name identification
    system for biomedical literature. BMC Bioinformatics 11(1), 85 (2010)
10. Hahn, U., Buyko, E., Tomanek, K., Piao, S., Mcnaught, J., Tsuruoka, Y., Anani-
    adou, S.: An Annotation Type System for a Data-Driven NLP Pipeline (2007)
11. Hahn, U., Buyko, E., Landefeld, R., Mühlhausen, M., Poprat, M., Tomanek, K.,
    Wermter, J.: An overview of JCoRe, the JULIE lab UIMA component repository.
    In: Proceedings of the LREC. vol. 8, p. 1–7 (2008)
12. Imam, F.T., Larson, S.D., Grethe, J.S., Gupta, A., Bandrowski, A., Martone, M.E.:
    NIFSTD and NeuroLex: a comprehensive neuroscience ontology development based
    on multiple biomedical ontologies and community involvement (2011)
13. Jessop, D., et al.: OSCAR4: a flexible architecture for chemical text-mining. Jour-
    nal of Cheminformatics 3(1), 41 (Oct 2011)
14. Kim, J.D., Ohta, T., Tateisi, Y., Tsujii, J.: GENIA corpus–a semantically anno-
    tated corpus for bio-textmining. Bioinformatics 19, i180–i182 (Jul 2003)
15. Kontonatsios, G., Korkontzelos, I., Kolluru, B., Thompson, P., Ananiadou, S.: De-
    ploying and sharing u-compare workflows as web services. J. Biomedical Semantics
    4, 7 (2013)
16. Krallinger, M., Morgan, A., Smith, L., Leitner, F., Tanabe, L., Wilbur, J.,
    Hirschman, L., Valencia, A.: Evaluation of text-mining systems for biology:
    overview of the second BioCreative community challenge. Genome Biology 9(Suppl
    2), S1 (2008)
17. Kuo, C.J., et al.: BioAdi: a machine learning approach to identifying abbreviations
    and definitions in biological literature. BMC Bioinformatics 10(Suppl 15), S7 (Dec
    2009)
18. Leaman, R., Gonzalez, G., et al.: BANNER: an executable survey of advances
    in biomedical named entity recognition. In: Pacific Symposium on Biocomputing.
    vol. 13, p. 652–663 (2008)
19. Liu, H., et al.: BioLemmatizer: a lemmatization tool for morphological processing
    of biomedical text. Journal of Biomedical Semantics 3(1), 3 (Apr 2012)
20. Natale, D.A., Arighi, C.N., Barker, W.C., Blake, J.A., Bult, C.J., Caudy, M.,
    Drabkin, H.J., D’Eustachio, P., Evsikov, A.V., Huang, H., Nchoutmboube, J.,
    Roberts, N.V., Smith, B., Zhang, J., Wu, C.H.: The protein ontology: a structured
    representation of protein forms and complexes. Nucleic Acids Res. 39(Database
    issue), D539–545 (Jan 2011)
21. Ogren, P.V., Bethard, S.J.: Building test suites for UIMA components. NAACL
    HLT 2009 p. 1 (2009)
22. Ogren, P.V., Wetzler, P.G., Bethard, S.J.: ClearTK: a UIMA toolkit for statistical
    natural language processing. Towards Enhanced Interoperability for Large HLT
    Systems: UIMA for NLP p. 32 (2008)
23. Osborne, J., Flatow, J., Holko, M., Lin, S.M., Kibbe, W.A., Zhu, L.J., Danila, M.I.,
    Feng, G., Chisholm, R.L.: Annotating the human genome with disease ontology.
    BMC Genomics 10(Suppl 1), S6 (Jul 2009)
24. Pyysalo, S., Ohta, T., Rak, R., Sullivan, D., Mao, C., Wang, C., Sobral, B., Tsujii,
    J., Ananiadou, S.: Overview of the ID, EPI and REL tasks of BioNLP shared task
    2011. BMC Bioinformatics 13(Suppl 11), S2 (Jun 2012)
25. Rak, R., Rowley, A., Black, W., Ananiadou, S.: Argo: an integrative, interactive,
    text mining-based workbench supporting curation. Database: the journal of bio-
    logical databases and curation 2012 (2012)
26. Ramakrishnan, C., Baumgartner Jr, W.A., Blake, J.A., Burns, G.A., Cohen, K.B.,
    Drabkin, H., Eppig, J., Hovy, E., Hsu, C.N., Hunter, L.E.: Building the scientific
    knowledge mine (SciKnowMine1): a community-driven framework for text mining
    tools in direct service to biocuration. malta. Language Resources and Evaluation
    (2010)
27. Ranjan, R., Khazen, G., Gambazzi, L., Ramaswamy, S., Hill, S.L., Schürmann,
    F., Markram, H.: Channelpedia: an integrative and interactive database for ion
    channels. Frontiers in neuroinformatics 5 (2011)
28. Savova, G.K., Masanz, J.J., Ogren, P.V., Zheng, J., Sohn, S., Kipper-Schuler,
    K.C., Chute, C.G.: Mayo clinical text analysis and knowledge extraction system
    (cTAKES): architecture, component evaluation and applications. Journal of the
    American Medical Informatics Association 17(5), 507–513 (2010)
29. Tanenblatt, M.A., Coden, A., Sominsky, I.L.: The ConceptMapper approach to
    named entity recognition. In: LREC (2010)
30. Thompson, P., et al.: The BioLexicon: a large-scale terminological resource for
    biomedical text mining. BMC Bioinformatics 12(1), 397 (2011)
31. Tomanek, K., Wermter, J., Hahn, U.: A reappraisal of sentence and token splitting
    for life sciences documents. Studies in health technology and informatics 129(Pt
    1), 524–528 (2006)