<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Exploring the Synergies between Biocuration and Ontology Alignment Automation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>David Dearing</string-name>
          <email>ddearing@stottlerhenke.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Terrance Goan</string-name>
          <email>goan@stottlerhenke.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Stottler Henke Associates, Inc.</institution>
          <addr-line>1107 NE 45</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Researchers have long recognized the value trapped in natural language publications and have continued to advance the development of ontologies that can help unleash this value. Among these advances are efforts to apply NLP techniques to streamline the labor-intensive process of scientific literature curation, which encodes relevant information in a form that is accessible to both humans and computers. In this paper, we report on our initial efforts to improve ontology alignment within the context of scientific literature curation by exploiting value within a large corpus of annotated PubMed abstracts. We employ an ensemble learning approach to augment a collection of publicly available ontology matching systems with a matching technique that leverages the word embeddings learned from this corpus in order to more successfully match the concepts of two disease ontologies (MeSH and OMIM). Our experiments show that word embedding-based similarity scores do contribute value beyond traditional matching systems. Our results show that the performance of an ensemble trained on a small number of manually reviewed mappings is improved by their inclusion.</p>
      </abstract>
      <kwd-group>
        <kwd>Ontology Matching Ensembles</kwd>
        <kwd>Word Embeddings</kwd>
        <kwd>Biocuration</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Technological advancements have given rise to an explosion in the rate that biomedical
data is generated. The incredible volume of data now far exceeds the ability of
researchers to capitalize on it. This is due, in large part, to the vagaries of the natural
languages in which that data is published for consumption by human readers. The wide
variety of lexical forms employed in the research literature present persistent challenges
for both humans and computers in finding, assessing, and assimilating relevant data.</p>
      <p>The research community has long recognized the value trapped in natural language
publications and has continued to advance the development of ontologies that can
mitigate the challenges posed by natural language. Today, ontologies are a critical
foundation for emerging technologies that seek to better inform and accelerate biomedical
research. Notable among recent advances are efforts to apply Natural Language
Processing (NLP) techniques to streamlining the labor-intensive processes of biocuration
and systematic scientific reviews.</p>
      <p>
        Biocuration involves the interpretation, representation, and integration of
information relevant to biology into a form that is accessible to both humans and computers.
This process results in databases or knowledgebases (e.g., UniProt [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], NCBI Database
Resources [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], and the Rat Genome Database (RGD) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]) that assimilate the scientific
literature as well as large data sets. Biocuration efforts range in both approach and
scope, but they are increasingly supported by automated tools that facilitate information
triage and tagging [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ].
      </p>
      <p>
        Similar to biocuration is the systematic review: a literature review that gathers and
analyzes research literature according to a structured methodology and guided by one
or more specific research questions. The aim of systematic review is to produce an
exhaustive summary of current literature relevant to those research questions.
Sometimes a systematic review is simply an instance of a biocuration effort without sufficient
resources to codify the collected knowledge [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. As with biocuration, there are
increasing efforts to employ natural language processing and other artificial intelligence
methods to streamline an expert-driven process that is otherwise very labor intensive [
        <xref ref-type="bibr" rid="ref10 ref7 ref8 ref9">7-10</xref>
        ].
      </p>
      <p>Biocuration and systematic review processes (whether manual or automated) are
complicated by the applicability of overlapping ontologies that cover a breadth of
multispecies knowledge that ranges across biological scales from molecules to populations.
Ultimately, the exploitation of numerous (but well-aligned) ontologies will provide a
comprehensive landscape of biomedical knowledge that will speed the identification of
new hypotheses and avenues of investigation.</p>
      <p>In this paper, we report on our initial efforts to improve ontology alignment within
the context of scientific literature curation. More specifically, we describe an ensemble
learning approach that augments a collection of ontology matching systems with word
embeddings generated from an annotated corpus of relevant scientific literature.</p>
      <p>The rest of this paper is organized as follows: In the next section, we provide
background and discuss related work. In Sections 3 and 4 we describe our experiments,
research hypothesis, and results. Finally, in Section 5, we summarize our conclusions
and plans for future work, including extensions that support learning from
work-centered user interactions.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Background and Related Work</title>
      <p>
        The best-performing ontology matching tools all rely on collections of complementary
matchers in order to compensate for context-specific weaknesses of each
contributing/competing heuristic. The challenge of matcher selection and evidence
combination has been addressed in a variety of ways ranging from ad hoc rules and manual
settings [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] to ensemble learning methods [
        <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
        ] that utilize machine learning to
select and weight contributing matchers. Methods, such as “mapping gain”
measurement, are applicable to the related challenge of selecting appropriate background
knowledge sources [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
      <p>
        Background knowledge sources play an important role in the performance of
ontology matching tools. While string distance measures and taxonomic structure
comparison form the backbone of most tools for ontology matching, it is also widely recognized
that ontologies constructed by independent experts can differ significantly in both
organization and lexical features. In these situations, researchers commonly seek to
bridge the gap by drawing on various sources of background knowledge, such as: other
ontologies, thesauri, lexical databases, online encyclopedias, and text corpora [
        <xref ref-type="bibr" rid="ref11 ref14">11, 14</xref>
        ].
These knowledge sources can then be used to implement matching functions that
account for spelling variations and synonyms, and that also support some measure of
semantic comparison [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
      </p>
      <p>
        One approach to measuring semantic similarity of elements is to employ WordNet
similarity [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. However, WordNet offers little coverage of concepts found in
realworld ontologies. Another approach is to learn word embeddings directly from text
corpora. Word embeddings are distributed word representations that are trained
through deep neural networks. Each dimension of the embeddings represents a latent
feature of the word, often capturing useful syntactic and semantic properties [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
      </p>
      <p>
        Word embeddings have proved to be useful at improving the performance of a wide
range of Natural Language Processing (NLP) tasks [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. Zhang et al. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] showed that
word embeddings learned over Wikipedia can improve the effectiveness of matcher
ensembles applied to OAEI benchmark, conference track, and real-world ontologies.
      </p>
      <p>
        Our own work is similar to that of Zhang et al. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] but is differentiated in two
primary ways. First, we learn word embeddings from a corpus of annotated scientific
literature related to the ontologies to be aligned, rather than from Wikipedia. Second,
we employ ensemble learning to integrate open source ontology matchers with our
word embedding based matcher.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Experimental Setup</title>
      <p>Our research centers on the hypothesis that the information gleaned from the word
embeddings learned from a relevant, annotated corpus would improve matching results
within a learned ensemble of existing open source ontology matchers. We tested this
hypothesis with systematic experiments using the datasets and techniques described in
the following.
3.1</p>
      <sec id="sec-3-1">
        <title>Datasets</title>
        <p>
          To evaluate our ensemble matching system, we used two ontologies of disease
vocabularies: the subset of the Online Mendelian Inheritance in Man (OMIM) disease
vocabulary, a flat list of disease terms covering genetic disorders; and the ‘Diseases’ branch
of the National Library of Medicine’s Medical Subject Headings (MeSH). A third
vocabulary, the Comparative Toxicogenomics Database’s (CTD) ‘merged disease
vocabulary’ (MEDIC) [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] serves as a reference alignment between OMIM and MeSH. We
chose these datasets primarily because there exists a corpus of PubMed titles and
abstracts where disease mentions are annotated with the corresponding MEDIC
identifiers—such a corpus is needed to train the model from which we train the underlying
neural network for our word embedding matcher. In particular, PubTator (a Web-based
tool for accelerating manual literature curation) provides an archive of the computer
annotation results for the entire collection of PubMed articles in PubTator1. This
computer-annotated corpus is generated using the DNorm tool for disease named entity
recognition [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ].
        </p>
        <p>The data files for our ontologies were collected at the end of 2015 for the MeSH,
OMIM, and MEDIC disease vocabularies. The ontology for the MeSH ‘Diseases’
branch includes 11,344 concepts. The ontology of OMIM genetic disorders includes
8,064 concepts. The MEDIC reference alignment identifies 3,435 direct mappings
between MeSH and OMIM concepts. Lastly, the entire PubTator corpus contains
14,412,044 documents.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Word Embedding Matcher (Word2vec)</title>
        <p>
          Our word embedding matcher uses the similarity scores, as learned by the Word2vec
component of the Deeplearning4j library [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ], as the confidence for a match between a
given pair of ontology concepts. Word2vec is a two-layer neural net that processes
text, taking a text corpus as input and outputting a set of feature vectors for words in
the corpus. The vectors used to represent words are called neural word embeddings
and represent a word with numbers based on other neighboring words within the input
corpus (see Table 1). Given a large enough corpus, Word2vec can make highly
accurate guesses about a particular word’s meaning—without human intervention—based
solely on numerical representations of word features, such as the context of individual
words. Word embedding similarity scores are calculated as the cosine similarity of the
vectors for a pair of concepts in the MeSH and OMIM ontologies.
marrow, (bmt), solid-organ, disseminated, allogeneic, …
pressure, rate, hypotension, arterial, concentration, …
rate, cardiac, re-infarction, pressure, o2, arterial, …
renal, hepatic, failure, acute, function, chronic, …
        </p>
        <p>Before training the Word2vec model, we preprocess the PubTator corpus so that the
annotated phrases for each PubMed document (title and abstract) are replaced by a
unique single-token identifier for the corresponding MeSH or OMIM concept. This is
necessary because Word2vec learns similarity vectors based on individual
words/tokens (and not multi-word phrases). The unique identifier allows us to look up similarity
scores for a given pair of concepts from the trained word embedding model. We used
Deeplearning4j’s suggested configuration: a word window size of 10 for calculating
within-sentence word context and the skip-gram technique for predicting the target
context, which produces more accurate results on large datasets.
1 https://ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator/tutorial/index.html#DownloadFTP</p>
        <p>Training the Word2vec model for more than 14 million documents is very time
consuming (on the order of weeks). Once the model is built, however, extracting the
similarity score for a given pair of terms is fast. The training time can be reduced by
distributing the processing with, for example, an Apache Spark cluster.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Ontology Matching Systems</title>
        <p>In addition to the word embedding matcher, we also utilized a number of publicly
available ontology matching systems. These matchers are used both alone and as part of a
learned ensemble to evaluate the relative impact of the addition of our word embedding
matcher. These systems have all participated in past Ontology Alignment Evaluation
Initiative (OAEI) campaigns.</p>
        <p>
          LogMap. LogMap [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ] is a scalable ontology matching system that utilizes highly
optimized data structures to index the input ontologies (both lexically and structurally) to
compute an initial set of anchor mappings with corresponding confidence values.
These anchors are then used in an iterative process of mapping repair and mapping
discovery to uncover new mappings.
        </p>
        <p>
          AgreementMakerLight (AML). AML is an ontology matching framework based on
AgreementMaker [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ], one of the leading ontology matching systems. However,
whereas AgreementMaker is memory-intensive and was not designed to match
ontologies with more than a few thousand concepts, AML is a lightweight system developed
with a focus on computational efficiency and is specialized on the biomedical domain
but applicable to any ontologies.
        </p>
        <p>
          Generic Ontology Matching and Mapping Management (GOMMA). GOMMA
provides a comprehensive and scalable infrastructure to manage large life science
ontologies, but as a generic tool it can be used to match ontologies from other domains
[
          <xref ref-type="bibr" rid="ref24">24</xref>
          ]. GOMMA preprocesses all information relevant for matching ontology concepts
(e.g., name, synonyms, comments) and uses maximal string similarity to generate
matches before aggregating the mappings, filtering out any mappings below a certain
threshold, and applying constraints to improve the consistency of mappings.
(not) Yet Another Matcher (YAM++). The underlying idea of the YAM++ system is
that the complexity and, therefore, the cost of the ontology matching algorithms can be
reduced by using indexing data structures to avoid exhaustive pair-wise comparisons
[
          <xref ref-type="bibr" rid="ref25">25</xref>
          ]. YAM++ preprocesses the input ontologies to calculate the information content of
each word to determine the weights of labels. Candidate mappings are passed to a
process that uses machine learning to combine several different string-based
comparisons to compare the labels/synonyms of entities. Those results are then passed to a
structural matcher, which looks at related entities to find more mappings, before
combining and filtering the results.
Falcon-AO. Falcon-AO is a prominent component of the Falcon infrastructure for
Semantic Web applications [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ]. For our datasets, Falcon-AO primarily uses
partitionbased block matching (PBM), which first divides each ontology into blocks that have a
high degree of cohesiveness; then, mappings are discovered by matching similar
blocks. The similarity between blocks is a function of the number of “anchors”
(alignments with high similarity based on string comparison techniques) that they share.
3.4
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>Ensemble Learning</title>
        <p>
          We utilize machine learning techniques to determine the weights and confidence level
thresholds for each ensemble configuration, allowing for the systematic learning of
rules for estimating the correctness of a correspondence based on the output of the
different techniques. Our experiments were conducted with the Weka Toolkit [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ], using
the Weka implementation of the REPTree classifier, a fast decision tree learner which
builds a tree using information gain as the splitting criterion and then prunes it using
reduced error pruning. Our feature vectors comprise the individual mapping confidence
scores for each technique being evaluated as well as a single meta-level
feature—average matcher confidence. The inclusion of this meta-level feature is based on the
findings of Eckert et al. [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] in which it was found that the most significant feature was not
the confidence scores themselves, but the fraction of matchers that found a
correspondence. All experiments were conducted with the default Weka classifier settings,
making our experiments more easily reproducible.
        </p>
        <p>Dealing with imbalanced data. Each individual matcher can generate mappings with
a range of confidence scores between 0.0 and 1.0 and, unsurprisingly, a large number
of incorrect mappings appear at low confidence levels. This introduces a problem
during classifier training known as class imbalance—a large difference in the number of
positive and negative instances used to train a classifier (i.e., correct vs. incorrect
mappings), which may result in a classifier that is biased towards this majority class. At
the extreme, this can lead to a classifier with high accuracy that has actually learned to
always choose the majority class (i.e., that the mapping is incorrect). In order to
account for this when training the classifier, we use a common resampling approach in
which the training instance are sampled to provide an even distribution of correct and
incorrect training instances. We achieve this by using the Resample filter of the Weka
framework for sampling without replacement, and biasing towards a uniform class
distribution (i.e., an even split between positive and negative instances).
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>Here we describe the results of our experiments to evaluate the performance of our
Word2vec-based word embedding matcher. We analyze the performance of the word
embedding matcher both in isolation and by measuring its contribution when combined
with one or more existing ontology matching systems, showing that this novel
technique adds value that is not identified by standard ontology matching systems.</p>
      <p>For the evaluation of each particular classifier configuration, we follow a technique
meant to mimic a practical training process for each classifier within the context of
scientific literature curation. More specifically, we limit the training of each classifier
to a small subset of the mappings produced by the corresponding matchers. We split
the training collection into n folds, with each fold consisting of approximately 362
instances, and train a separate classifier on each of the n individual folds. This is meant
to simulate the process of training the classifier with a small number of manually
reviewed mappings. 362 was chosen as the approximate fixed size for each fold so that
the smallest training collection (YAM++ by itself; 3,628 mappings) would have 10
folds for training. Every evaluation uses the same test collection, consisting of the
union of all of the potential mappings generated by each of the matching systems
(including Word2vec). This allows for a more accurate comparison of the evaluation
results across different classifier configurations. We report the average and standard
deviation of the traditional precision, recall, and F-measure metrics across each of the n
folds for each classifier configuration.
4.1</p>
      <sec id="sec-4-1">
        <title>Word Embedding Similarity Scores</title>
        <p>We first analyzed the similarity scores produced by the Word2vec technique, which are
the cosine similarity of the vectors for each pair of concepts in the MeSH and OMIM
ontologies. For comparison, we built two word embedding models for the PubTator
corpus: one with the standard configuration and one providing a list of stop words,
which Word2vec ignores during training. The chart in Fig. 1 shows the raw counts of
the correct and incorrect mappings for both of these models.
Fig. 1. The raw number of correct and incorrect mappings by Word2vec similarity score for
two word embedding models, trained with and without stop words ignored.</p>
        <p>The results from both models are very similar, with the global distribution of
similarity scores (both correct and incorrect) following a normal distribution. The
2000
1800
s1600
ign1400
p
ap1200
fM1000
o
re 800
b
m600
u
N 400
200
0
Correct Mappings
Incorrect Mappings
Correct Mappings;
Stop Words Ignored
Incorrect Mappings;
Stop Words Ignored
Word2vec model that ignores stop words finds slightly more correct mappings when at
lower values for the similarity score threshold (i.e., below 0.9). It is understandable
that ignoring stop words makes little difference if the window size is sufficient, since
the Word2vec model automatically accounts for the information gain afforded by
specific context words (which should be near zero for stop words). In both models, the
number of incorrect mappings increases drastically as the similarity score threshold
decreases, with the number of correct and incorrect mappings being roughly equal with
a similarity score threshold of 0.85.</p>
        <p>For our experiments, we use similarity scores of at least 0.69. This threshold was
chosen so that the number of mappings would be at least twice the size of the larger of
the two ontologies (the MeSH ontology contains 11,344 concepts) because a concept
in the MeSH ontology may map to more than one concept in the smaller OMIM
ontology (8,064 concepts), but not the other way around. By comparison, the number of
potential mappings generated by the other ontology matching systems ranges from
3,628 to 7,145. Classifiers trained from the Word2vec similarity scores alone do not
perform particularly well (Table 2). Surprisingly, precision was high and recall was
low, which is the reverse of what we had expected. For our remaining reported
experimental results, we use the model with stop words ignored, representing 25,610 total
instances (5.6% of which are correct mappings).
For our baseline, we first look at each ontology matching system alone, using our
ensemble approach to learn how to distinguish correct from incorrect mappings using only
the confidence scores produced by each system (Table 3).</p>
        <p>The scores for each individual matching system vary widely, which is not
particularly surprising given the relatively small fixed-size folds that are used for training each
classifier. In the individual configuration, GOMMA and Falcon-AO perform the best
on these datasets, with F-measures of 0.590 and 0.546, respectively. Having identified
the baseline values for each ontology matching system, we then included the similarity
scores generated from our Word2vec word embedding matcher when training a new
ensemble for each of the individual ontology matching systems (Table 3).</p>
        <p>When including the Word2vec similarity scores, we see improved F-measure scores
across the board and, in general, the standard deviation for each statistic decreases. The
most significant gains are to the recall of the LogMap and AML systems as well as in
the precision of LogMap and YAM++. Interestingly, the recall for YAM++ drops when
adding Word2vec similarity scores.</p>
        <p>Finally, we combined all of the ontology matching systems together to compare the
results both with and without Word2vec, as shown in Table 4. The F-measure for the
model trained using the results from all of the ontology matching systems (without
Word2vec) improves over the classifiers trained on the results of each system alone
(even if the improvement is only marginal, as in the case of GOMMA). The only
evaluation statistics to decrease in the full ensemble configuration are the recall for
GOMMA and for YAM++.</p>
        <p>Word2Vec contributes value beyond the traditional matching systems: including the
Word2vec similarity scores when training the ensemble model boosts recall, precision,
and F-measure (the standard deviation across each training fold also increases).</p>
        <p>Interestingly, when comparing the performance of the full ensemble classifier (with
Word2vec) against the individual matchers each paired with Word2vec, we see that the
F-measure for both AML and GOMMA does not change significantly when including
the other systems. This would seem to indicate that neither GOMMA nor AML, when
combined with Word2vec, are further improved by adding any of the additional
matching systems. However, note that GOMMA produces the highest recall of any
combination evaluated (0.846 ±0.239), whereas the full ensemble and AML (each including
Word2vec) appear to be more balanced as illustrated by their lower recall and higher
precision scores.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusions and Future Work</title>
      <p>In this paper, we have described an ensemble learning approach that augments a
collection of ontology matching systems with word embeddings generated from an
annotated corpus of relevant scientific literature. We have shown that, within this ensemble
approach to ontology matching, the information within word embeddings does
contribute to learning an improved model for identifying correct alignments between two
ontologies, beyond what state-of-the-art ontology matching systems identify—both
individually and in combination. More specifically, the best overall performance (by
Fmeasure) was found in the combination of word embedding-derived similarity scores
with either the full ensemble containing all of the matching systems under evaluation
or the individual AML and GOMMA matching system. However, each of those
configurations differed in precision and recall and, therefore, the needs of any particular
use case will inform the best configuration for each individual situation.</p>
      <p>
        There are also several items that remain to be answered by future work as well as by
our own ongoing research. First, we are currently analyzing the PubTator corpus to
extract a list of multi-word expressions—using a novel technique for extracting salient
variable-length phrases from large text corpora [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ]—which we will use in a similar
approach to preprocess the corpus and, prior to training the word embedding model,
remove all text that is not among the top expressions in the corpus. We also see
opportunities to improve upon our ensemble learning approach by providing additional
metalevel features when training our ensemble model, such as binary matcher voting, global
ontology features, and concept-specific lexical features used by Eckert et al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>Repeating our experiments with different ontologies and/or in a different domain
would help to corroborate our results. Training the relevant Word2vec model, however,
requires identifying a sufficiently large domain-relevant corpus that is also annotated
with concepts from those ontologies. Given a domain-relevant corpus, it may be
possible to use an automated system to automatically detect and annotate concept labels in
text, as was done by the DNorm disease tagger for the PubTator corpus.</p>
      <p>There is also an opportunity to significantly reduce the processing time needed to
train a Word2vec model from a given corpus. We briefly explored using
Deeplearning4j’s support for the Apache Spark cluster-computing framework, but we were
unable to fully implement the functionality due to time limitations. With Spark,
Deeplearning4j can distribute the processing and train models in parallel for individual shards of
the large corpus before iteratively averaging the parameters into a central model.</p>
      <p>Lastly, in specific regard to manual biocuration and systematic review processes, we
see an opportunity to exploit additional sources of evidence beyond the resulting
annotated corpus. More specifically, it may be possible to collect incremental pieces of
feedback from work-centered interfaces over the course of a user’s normal interaction
during biocuration and annotation tasks—for example, while searching for or
disambiguating specific concepts for annotating a particular text mention or reference—that
can be utilized to further improve ontology matching processes.</p>
      <sec id="sec-5-1">
        <title>Acknowledgments</title>
        <p>This work is supported by the US Army Medical Research and Materiel Command
under Contract No. W81XWH-13-C-0036.</p>
        <p>The views, opinions and/or findings contained in this report are those of the author(s)
and should not be construed as an official Department of the Army position, policy or
decision unless so designated by other documentation.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>The</given-names>
            <surname>Uniprot Consortium</surname>
          </string-name>
          (
          <year>2015</year>
          )
          <article-title>UniProt: a hub for protein information</article-title>
          .
          <source>Nucleic Acids Res</source>
          .
          <volume>43</volume>
          :
          <fpage>D204</fpage>
          -
          <lpage>D212</lpage>
          . doi:
          <volume>10</volume>
          .1093/nar/gku989
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Sayers</surname>
            <given-names>EW</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barrett</surname>
            <given-names>T</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Benson</surname>
            <given-names>DA</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bolton</surname>
            <given-names>E</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bryant</surname>
            <given-names>SH</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Canese</surname>
            <given-names>K</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Feolo</surname>
            <given-names>M</given-names>
          </string-name>
          (
          <year>2012</year>
          )
          <article-title>Database resources of the national center for biotechnology information</article-title>
          .
          <source>Nucleic acids research</source>
          <volume>40</volume>
          (
          <issue>D1</issue>
          ):
          <fpage>D13</fpage>
          -
          <lpage>D25</lpage>
          . doi:
          <volume>10</volume>
          .1093/nar/gks1189
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Shimoyama</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>De Pons</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hayman</surname>
            <given-names>GT</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Laulederkind</surname>
            <given-names>SJ</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            <given-names>W</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nigam</surname>
            <given-names>R</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Petri</surname>
            <given-names>V</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smith</surname>
            <given-names>JR</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tutaj</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            <given-names>SJ</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Worthey</surname>
            <given-names>E</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dwinell</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jacob</surname>
            <given-names>H</given-names>
          </string-name>
          (
          <year>2015</year>
          )
          <article-title>The rat genome database 2015: genomic, phenotypic and environmental variations and disease</article-title>
          .
          <source>Nucleic Acids Res</source>
          .
          <volume>43</volume>
          :
          <fpage>D743</fpage>
          -
          <lpage>50</lpage>
          . doi:
          <volume>10</volume>
          .1093/nar/gku1026
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Ghiasvand</surname>
            <given-names>O</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shimoyama</surname>
            <given-names>M</given-names>
          </string-name>
          (
          <year>2016</year>
          )
          <article-title>Introducing a text annotation tool (OnToMate); assisting curation at rat genome database</article-title>
          .
          <source>In: Proceedings of the 7th ACM international conference on bioinformatics, computational biology, and health informatics (BCB '16)</source>
          . ACM, New York, pp
          <fpage>465</fpage>
          -
          <lpage>465</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Poux</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Arighi</surname>
            <given-names>CN</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Magrane</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bateman</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wei</surname>
            <given-names>CH</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lu</surname>
            <given-names>Z</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boutet</surname>
            <given-names>E</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bye-A-Jee</surname>
            <given-names>H</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Famiglietti</surname>
            <given-names>ML</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roechert</surname>
            <given-names>B</given-names>
          </string-name>
          (
          <year>2017</year>
          )
          <article-title>On expert curation and scalability: UniProtKB/Swiss-Prot as a case study</article-title>
          .
          <source>Bioinformatics btx439</source>
          . doi: https://doi.org/10.1093/bioinformatics/btx439
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Rodriguez-Esteban</surname>
            <given-names>R</given-names>
          </string-name>
          (
          <year>2015</year>
          )
          <article-title>Biocuration with insufficient resources and fixed timelines</article-title>
          .
          <source>Database: The Journal of Biological Databases and Curation</source>
          <year>2015</year>
          ;
          <year>2015</year>
          : bav116. doi:
          <volume>10</volume>
          .1093/database/bav116
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Marshall</surname>
            <given-names>C</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brereton</surname>
            <given-names>P</given-names>
          </string-name>
          (
          <year>2015</year>
          )
          <article-title>Systematic review toolbox: A catalogue of tools to support systematic reviews</article-title>
          .
          <source>In: Proceedings of the 19th international conference on evaluation and assessment in software engineering. ACM</source>
          , New York, p
          <fpage>23</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Choong</surname>
            <given-names>MK</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Galgani</surname>
            <given-names>F</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dunn</surname>
            <given-names>AG</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tsafnat</surname>
            <given-names>G</given-names>
          </string-name>
          (
          <year>2014</year>
          )
          <article-title>Automatic evidence retrieval for systematic reviews</article-title>
          .
          <source>J Med Internet Res</source>
          <year>2014</year>
          ;
          <volume>16</volume>
          (
          <issue>10</issue>
          ): e223. doi:
          <volume>10</volume>
          .2196/jmir.3369
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Wallace</surname>
            <given-names>BC</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuiper</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sharma</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu</surname>
            <given-names>MB</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marshall</surname>
            <given-names>IJ</given-names>
          </string-name>
          (
          <year>2016</year>
          )
          <article-title>Extracting PICO sentences from clinical trial reports using supervised distant supervision</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>17</volume>
          (
          <issue>132</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>25</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Basu</surname>
            <given-names>T</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kumar</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kalyan</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jayaswal</surname>
            <given-names>P</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goyal</surname>
            <given-names>P</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pettifer</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jonnalagadda</surname>
            <given-names>S</given-names>
          </string-name>
          (
          <year>2016</year>
          )
          <article-title>Systematic reviews by automatically building information extraction training corpora</article-title>
          .
          <source>arXiv preprint arXiv:1606</source>
          .
          <fpage>06424</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Shvaiko</surname>
            <given-names>P</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Euzenat</surname>
            <given-names>J</given-names>
          </string-name>
          (
          <year>2013</year>
          )
          <article-title>Ontology matching: state of the art and future challenges</article-title>
          .
          <source>IEEE transactions on knowledge and data engineering</source>
          <volume>25</volume>
          (1): pp
          <fpage>158</fpage>
          -
          <lpage>176</lpage>
          . doi:
          <volume>10</volume>
          .1109/TKDE.
          <year>2011</year>
          .253
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Eckert</surname>
            <given-names>K</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Meilicke</surname>
            <given-names>C</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stuckenschmidt H</surname>
          </string-name>
          (
          <year>2009</year>
          )
          <article-title>Improving ontology matching using metalevel learning</article-title>
          . In: Aroyo L, et al. (eds).
          <source>LNCS</source>
          , volume
          <volume>5554</volume>
          . Springer International Publishing, Cham, Switzerland, pp
          <fpage>158</fpage>
          -
          <lpage>172</lpage>
          . doi: https://doi.org/10.1007/978-3-
          <fpage>642</fpage>
          -02121-3
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Gal</surname>
            <given-names>A</given-names>
          </string-name>
          (
          <year>2011</year>
          )
          <article-title>Uncertain schema matching</article-title>
          .
          <source>Synthesis Lectures on Data Management</source>
          <volume>3</volume>
          (
          <issue>1</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>97</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Faria</surname>
            <given-names>D</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pesquita</surname>
            <given-names>C</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Santos</surname>
            <given-names>E</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cruz</surname>
            <given-names>IF</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Couto</surname>
            <given-names>FM</given-names>
          </string-name>
          (
          <year>2014</year>
          )
          <article-title>Automatic background knowledge selection for matching biomedical ontologies</article-title>
          .
          <source>PLoS ONE</source>
          <volume>9</volume>
          (
          <issue>11</issue>
          ): e111226. doi: https://doi.org/10.1371/journal.pone.0111226
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Zhang</surname>
            <given-names>Y</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            <given-names>X</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lai</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>He</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            <given-names>K</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhao</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lv</surname>
            <given-names>X</given-names>
          </string-name>
          (
          <year>2014</year>
          )
          <article-title>Ontology matching with word embeddings</article-title>
          . In: Maosong S,
          <string-name>
            <surname>Liu</surname>
            <given-names>Y</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhao</surname>
            <given-names>J</given-names>
          </string-name>
          (eds)
          <article-title>Chinese computational linguistics and natural language processing based on naturally annotated big data</article-title>
          . Springer International Publishing, Cham, Switzerland, pp
          <fpage>34</fpage>
          -
          <lpage>45</lpage>
          . doi: https://doi.org/10.1007/978-3-
          <fpage>319</fpage>
          -12277-9
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Lin</surname>
            <given-names>F</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sandkuhl</surname>
            <given-names>K</given-names>
          </string-name>
          (
          <year>2008</year>
          )
          <article-title>A survey of exploiting wordnet in ontology matching</article-title>
          .
          <source>In: Bramer M (ed) Artificial intelligence in theory and practice</source>
          , vol
          <volume>2</volume>
          .
          <string-name>
            <surname>Springer</surname>
            <given-names>US</given-names>
          </string-name>
          , New York, pp
          <fpage>341</fpage>
          -
          <lpage>350</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-0-
          <fpage>387</fpage>
          -34747-9
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Turian</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ratinov</surname>
            <given-names>L</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            <given-names>Y</given-names>
          </string-name>
          (
          <year>2010</year>
          )
          <article-title>Word representations: a simple and general method for semi-supervised learning</article-title>
          . In:
          <article-title>Proceedings of the 48th annual meeting of the association for computational linguistics</article-title>
          .
          <source>Association for Computational Linguistics</source>
          , Stroudsburg, PA, pp
          <fpage>384</fpage>
          -
          <lpage>394</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Li</surname>
            <given-names>Y</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            <given-names>T</given-names>
          </string-name>
          (
          <year>2018</year>
          )
          <article-title>Word embedding for understanding natural language: survey</article-title>
          . In: Srinivasan S. (ed)
          <article-title>Guide to big data applications. Studies in big data</article-title>
          , vol
          <volume>26</volume>
          . Springer International Publishing, Cham, Switzerland, pp
          <fpage>83</fpage>
          -
          <lpage>104</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>319</fpage>
          -53817-4
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Davis</surname>
            <given-names>AP</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wiegers</surname>
            <given-names>TC</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosenstein</surname>
            <given-names>MC</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mattingly</surname>
            <given-names>CJ</given-names>
          </string-name>
          (
          <year>2012</year>
          )
          <article-title>MEDIC: a practical disease vocabulary used at the comparative toxicogenomics database</article-title>
          .
          <source>Database: The Journal of Biological Databases and Curation</source>
          <year>2012</year>
          ;
          <year>2012</year>
          : bar065. doi:
          <volume>10</volume>
          .1093/database/bar065
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Leaman</surname>
            <given-names>R</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Islamaj Doğan</surname>
            <given-names>R</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>Z</given-names>
          </string-name>
          (
          <year>2013</year>
          ).
          <article-title>DNorm: disease name normalization with pairwise learning to rank</article-title>
          .
          <source>Bioinformatics</source>
          <volume>29</volume>
          (22):
          <fpage>2909</fpage>
          -
          <lpage>2917</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Deeplearning4j Development Team</surname>
          </string-name>
          (
          <year>2016</year>
          )
          <article-title>Deeplearning4j: Open-source distributed deep learning for the JVM</article-title>
          . https://deeplearning4j.org/about.
          <source>Accessed 27 July 2017</source>
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Jiménez-Ruiz</surname>
            <given-names>E</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cuenca Grau</surname>
            <given-names>B</given-names>
          </string-name>
          (
          <year>2011</year>
          )
          <article-title>LogMap: Logic-based and scalable ontology matching</article-title>
          . In: Aroyo L et al. (
          <article-title>eds) The semantic web - ISWC 2011</article-title>
          .
          <source>ISWC 2011. Lecture Notes in Computer Science</source>
          , vol
          <volume>7031</volume>
          . Springer, Berlin, Heidelberg, pp
          <fpage>273</fpage>
          -
          <lpage>288</lpage>
          . doi: https://doi.org/10.1007/978-3-
          <fpage>642</fpage>
          -25073-6_
          <fpage>18</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Faria</surname>
            <given-names>D</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pesquita</surname>
            <given-names>C</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Santos</surname>
            <given-names>E</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Palmonari</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cruz</surname>
            <given-names>IF</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Couto</surname>
            <given-names>FM</given-names>
          </string-name>
          (
          <year>2013</year>
          )
          <article-title>The AgreementMakerLight ontology matching system</article-title>
          . In: Meersman R et al. (
          <article-title>eds) On the move to meaningful internet systems: OTM 2013 Conferences</article-title>
          .
          <source>OTM 2013. Lecture Notes in Computer Science</source>
          , vol
          <volume>8185</volume>
          . Springer, Berlin, Heidelberg, pp.
          <fpage>527</fpage>
          -
          <lpage>541</lpage>
          . doi: https://doi.org/10.1007/978-3-
          <fpage>642</fpage>
          -41030-7_
          <fpage>38</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Kirsten</surname>
            <given-names>T</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gross</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hartung</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rahm</surname>
            <given-names>E</given-names>
          </string-name>
          (
          <year>2011</year>
          )
          <article-title>GOMMA: a component-based infrastructure for managing and analyzing life science ontologies and their evolution</article-title>
          .
          <source>Journal of Biomedical Semantics</source>
          <volume>2</volume>
          (
          <issue>1</issue>
          ): 6. doi:
          <volume>10</volume>
          .1186/2041-1480-2-6
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Duyhoa</surname>
            <given-names>N</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bellahsene</surname>
            <given-names>Z</given-names>
          </string-name>
          (
          <year>2014</year>
          )
          <article-title>Overview of YAM++-(not) Yet Another Matcher for ontology alignment task</article-title>
          .
          <source>Dissertation</source>
          , LIRMM
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Hu</surname>
            <given-names>W</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Qu</surname>
            <given-names>Y</given-names>
          </string-name>
          (
          <year>2008</year>
          )
          <article-title>Falcon-AO: A practical ontology matching system</article-title>
          .
          <source>Web Semantics: Science, Services and Agents on the World Wide Web</source>
          <volume>6</volume>
          (
          <issue>3</issue>
          ):
          <fpage>237</fpage>
          -
          <lpage>239</lpage>
          . doi:
          <volume>10</volume>
          .1016/j.websem.
          <year>2008</year>
          .
          <volume>02</volume>
          .006
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Hall</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frank</surname>
            <given-names>E</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Holmes</surname>
            <given-names>G</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pfahringer</surname>
            <given-names>B</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reutemann</surname>
            <given-names>P</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Witten</surname>
            <given-names>I</given-names>
          </string-name>
          (
          <year>2009</year>
          )
          <article-title>The WEKA data mining software: an update</article-title>
          .
          <source>ACM SIGKDD Explorations Newsletter</source>
          <volume>11</volume>
          (
          <issue>1</issue>
          ):
          <fpage>10</fpage>
          -
          <lpage>18</lpage>
          . doi:
          <volume>10</volume>
          .1145/1656274.1656278
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <surname>Shang</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jiang</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ren</surname>
            <given-names>X</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Voss</surname>
            <given-names>CR</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Han</surname>
            <given-names>J</given-names>
          </string-name>
          (
          <year>2017</year>
          )
          <article-title>Automated phrase mining from massive text corpora</article-title>
          .
          <source>arXiv preprint arXiv:1702.04457</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>