Leveraging Wikipedia for Ontology Pattern Population from Text

          Michelle Cheatham and James Lambert                                  Charles Vardeman II
                          Wright State University                              University of Notre Dame
                         3640 Colonel Glenn Hwy.                          111C Information Technology Center
                            Dayton, OH 45435                                    Notre Dame, IN 46556


                           Abstract                                  of two common approaches to this task and show that, due to
                                                                     the fast-changing nature of computer technology, they lose
  Traditional approaches to populating ontology design pat-
  terns from unstructured text often involve using a dictionary,
                                                                     their effectiveness over time. We then evaluate the utility of
  rules, or machine learning approaches that have been estab-        using a continuously manually-curated knowledge base to
  lished based on a training set of annotated documents. While       mitigate this performance degradation. The results illustrate
  these approaches are quite effective in many cases, perfor-        that this method holds some promise.
  mance can suffer over time as the nature of the text documents        The remainder of this paper is organized as follows.
  changes to reflect advances in the domain of interest. This is     The section Computational Environment Representation
  particularly true when attempting to populate patterns related     presents the schema of the ontology we seek to populate,
  to fast-changing domains such as technology, medicine, or          while the Dataset section describes the collection of aca-
  law. This paper explores the use of Wikipedia as a source of       demic articles we use as our training and test sets. The ap-
  continually updated background knowledge to facilitate on-         proach and results are presented and analyzed next, and fi-
  tology pattern population as the domain changes over time.
                                                                     nally some conclusions and ideas for future work in this area
                                                                     are discussed.
                        Introduction
Two of the central underpinnings of scientific inquiry are            Computational Environment Representation
the need to verify results through reproduction of the ex-           The Computational Environment ODP was developed over
periments involved and the importance of “building on the            the course of several working sessions by a group of on-
shoulders of giants.” In order for these things to be possi-         tological modeling experts, library scientists, and domain
ble, experimental results need to be both discoverable and           scientists from different fields, including computational
reproducible. Important steps have been made recently in             chemists and high-energy physicists interested in preserving
pursuit of this, including the relaxation of page restrictions       analysis of data collected from the Large Hadron Collider
on “methodology” sections in many academic journals and              at CERN. Our goal was to arrive at an ontology design pat-
the requirements by some funding agencies that investiga-            tern that is capable of answering the following competency
tors make any data they collect publicly available. However,         questions:
in order for previous work to be truly verifiable and reusable,
researchers must be able to not only access the results of           • What environment do I need to put in place in order to
those efforts but also to understand the context in which they         replicate the work in Paper X?
were created. A key element of this is the need to preserve          • There has been an error found in Script Y. Which analyses
the underlying computations and analytical process that led            need to be re-run?
to prior results in a generic machine-readable format.               • Based on recent research in Field Z, what tools and re-
   In previous work towards this goal, we developed an                 sources should new students work to become familiar
ontology design pattern (ODP) to represent the computa-                with?
tional environment in which an analysis was performed. This
model is briefly described in Section of this work, and more         • Are the results from Study A and Study B comparable
detail is available in (Cheatham et al. 2017). This paper de-          from a computational environment perspective?
scribes our work on the next step: development of an au-                We focused on creating a model to capture the actual en-
tomated approach to populate the ODP based on data ex-               vironment present during a computational analysis. Repre-
tracted from academic articles. We explore the performance           senting all possible environments in which it is feasible for
Copyright held by the author(s). In A. Martin, K. Hinkelmann, A.     the analysis to be executed is outside of the scope of our
Gerber, D. Lenat, F. van Harmelen, P. Clark (Eds.), Proceedings of   current effort. We also do not include the runtime config-
the AAAI 2019 Spring Symposium on Combining Machine Learn-           uration and parameters as part of the environment. The ra-
ing with Knowledge Engineering (AAAI-MAKE 2019). Stanford            tionale is that to some extent the same environment should
University, Palo Alto, California, USA, March 25-27, 2019.           be applicable to many computational analyses in the same
field of study, but this would not be true if we included such         the desired schema using SPARQL construct queries, as de-
analysis-specific information as runtime parameters as part            scribed in (Zhou et al. 2018).
of the environment. Data sources were considered outside
of the confines of the computational environment for similar                  Listing 1: Sample entry from the gold standard
reasons. External web services and similar resources were              −b i o 15
not included because they are not inter-related in the same            cpu make : HT Xeon
way as the environmental elements are. For instance, decid-            cpu f r e q u e n c y v a l u e : 3 . 2
ing to use a different operating system often necessitates             cpu f r e q u e n c y u n i t : GHz
using a different version of drivers, libraries, and software          cpu num−c o r e s : 16
applications, whereas in most cases the entire hardware con-           memory t y p e : RAM
figuration could be changed with no impact on external ser-            memory s i z e v a l u e : 8
vices.                                                                 memory s i z e u n i t : GB
   Figure 1 shows the schema that we targeted for population           programming−l a n g u a g e : C / C++
in this study. It is a slightly modified version of the ODP pre-       o s k e r n e l name : L i n u x
sented in (Cheatham et al. 2017) – the overall goal and com-           o s d i s t r i b u t i o n name : Red Hat
petencies of the pattern remain the same, but some entities
have been omitted, and properties related to the manufac-                 A preliminary analysis of the training and test datasets in-
turer, make and model of computers and hardware compo-                 dicate that approaches trained on the older articles may face
nents have been added, as well as an entity related to pro-            difficulties. For example, Figure 2 shows the number of dis-
gramming language, because this information is important               tinct values for each tag in the test set (left), current training
for our current application goals.                                     set (middle) and old training set (right). From this we can
                                                                       see that the relative number of distinct values for each tag is
                            Dataset                                    about the same in the test set and the current training set (i.e.
                                                                       the tags with the largest number of distinct values in the test
The dataset consists of 100 academic articles published in             set are also those with the largest number of distinct values
2012 or later (called “current”)1 and 20 published in or be-           in the training set). This is less true for the old training set
fore 2002 (called “old”). There are 20 current papers and              – for instance, the older articles never talk about the num-
four old ones from each of five fields of study: biology,              ber of CPU cores (presumably because they were all single
chemistry, engineering, mathematics, and physics. Twenty               core) and the only programming language they mention is
of the current articles were randomly selected to make up              Fortran. It therefore may be difficult for models trained on
the current training set while the remainder form the test set.        the old training set to correctly extract information relevant
The documents were collected by searching Google Scholar               to these tags in the test set. New entities becoming relevant
using the query <field of study> AND algorithm (e.g. bi-               over time are sometimes termed “emergent entities” and are
ology and algorithm, chemistry and algorithm). Patents and             particularly challenging for many NER systems (Nakashole,
citations were omitted from the search criteria. Any paper             Tylenda, and Weikum 2013).
with a PDF download link was searched for the terms cpu                   Figure 3 shows the number of times each tag was used
and processor, in order to quickly determine if the paper had          in the test set and both training sets. It is evident that the
any content related to a computational environment. If the             older articles talk more at the level of computer manufac-
paper contained either term, it was retained in the dataset.           turer, make and model, while more current articles mention
   We then manually created a gold standard for the dataset            more details such as information about the CPU, the graph-
by going through each paper and listing any information rel-           ics card, and the amount of memory. Looking at both fig-
evant to the ODP shown in Figure 1. This process was com-              ures (2 and 3), we see that some tags, such as CPU manu-
pleted by a single person, but in the few cases in which a             facturer, are key for this ontology population task, because
question arose (e.g. whether an R4000 is a make or a model),           there are only a few distinct values that must be recognized,
the opinions of others and external knowledge sources such             but these few values occur many times within the test set.
as manufacturer’s websites and websites about historical               Models trained on the old training set may be at a particular
computing technologies were consulted. A typical entry in              disadvantage in these cases, if the values for those key tags
the gold standard is shown in Listing 1. Note that the en-             in the older documents are not reflective of those in the test
tries given do not directly correspond to entities in the on-          set.
tology. For example, the single tag memory size unit: GB
corresponds to an instance of the Memory class in the on-                               Approach and Results
tology with a hasSize of some Amount that in turn hasUnit
GB. Tagging based on each entity within the ontology, in-              In this work we formulate the problem of populating the
cluding those such as Memory that represent blank nodes,               computational environment pattern from text solely as a
would have made the tagging process more arduous, how-                 Named Entity Recognition (NER) task. This is in con-
ever, and the information can be readily expanded to match             trast to many other approaches that divide this process into
                                                                       two steps: entity recognition/typing and relation extraction
   1
     Three articles from the “current” set had to be thrown out at a   (Petasis et al. 2011). In other words, we are trying to directly
later stage due to irregularities that caused them to be unparseable   arrive at the tags specified in the gold standard as described
by multiple PDF parsers.                                               in the Dataset section, with the intention of later creating
                                                                                                                                                                                                                            so
                                                                                                                                                                                                                     cp ftw
                                                                                                                                                                                                                         u_ ar


                                                                                                                                                                                                                                                          0
                                                                                                                                                                                                                                                              5
                                                                                                                                                                                                                                                                   10
                                                                                                                                                                                                                                                                                  15
                                                                                                                                                                                                                                                                                            20
                                                                                                                                                                                                                                                                                                 25
                                                                                                                                                                                                                                                                                                      30
                                                                                                                                                                                                                                                                                                           35
                                                                                                                                                                                                                                                                                                                40
                                                                                                                                                                                                                                                                                                                     45
                                                                                                                                                                                                                             fre e_
                                                                                                                                                                                                                                  qu ver
                                                                                                                                                                                                                                      e          si
                                                                                                                                                                                                                              so ncy on
                                                                                                                                                                                                                                  ftw _v
                                                                                                                                                                                                                                       ar lue    a
                                                                                                                                                                                                 op                                       e_
                                                                                                                                                                                                    e                   m                      n
                                                                                                                                                                                                op ratin                   em cp am
                                                                                                                                                                                                  er                            o ry m
                                                                                                                                                                                                                                         u_ e
                                                                                                                                                                                                    at g-sy                          _s od
                                                                                                                                                                                                      in                                ize el
                                                                                                                                                                                                         g- stem
                                                                                                                                                                                                           sy                                _v
                                                                                                                                                                                                              st _d
                                                                                                                                                                                                                 em ist cp alu
                                                                                                                                                                                                                      _d ib     r         u  _ e
                                                                                                                                                                                                                         ist uti ma
                                                                                                                                                                                                                             rib on ke
                                                                                                                                                                                                                                  ut _n
                                                                                                                                                                                                                            co ion_ am
                                                                                                                                                                                                                               m                      e
                                                                                                                                                                                                                                   pu ver
                                                                                                                                                                                                                                       te sio
                                                                                                                                                                                                                                 m        r_          n
                                                                                                                                                                                                                                    em mo
                                                                                                                                                                                                                  pr                     o de
                                                                                                                                                                                                                      og cpu ry_ l
                                                                                                                                                                                                                         ra _n ty
                                                                                                                                                                                                                             m          um pe
                                                                                                                                                                                                                                m
                                                                                                                                                                                                                                   in -co
                                                                                                                                                                                                                  m           n
                                                                                                                                                                                                                                      g-
                                                                                                                                                                                                                                         la res
                                                                                                                                                                                                                    ul
                                                                                                                                                                                                                       tic um ngu
                                                                                                                                                                                                                           om pr ag  -
                                                                                                                                                                                                                                pu oce e
                                                                                                                                                                                                                 co                 te ss
                                                                                                                                                                                                                                       r_ or
                                                                                                                                                                                                                    m                     lo          s
                                                                                                                                                                                                                       pu                     c
                                                                                                                                                                                                                           te nu atio
                                                                                                                                                                                                                              r_ m                    n
                                                                                                                                                                                                                                  m -n
                                                                                                                                                                                                                                     an od
                                                                                                                                                                                                                                        uf es
                                                                                                                                                                                                                                            ac


lations between instances can be then determined based on
environment is described in an article, the appropriate re-
minology used by the ODP. If more than one computational
SPARQL construct queries to expand these tags into the ter-
                                                                                                                                                                                                                                      gp tur
                                                                                                                                                                                                                    m com u_ er
                                                                                                                                                                                                                       ul                    m
                                                                                                                                                                                                                          tic pu                 o
                                                                                                                                                                                                                             om ter del
                                                                                                                                                                                                                          m ute a  p         _m
                                                                                                                                                                                                         m                   e            r         ke
                                                                                                                                                                                                           ul mu mor _m
                                                                                                                                                                                                              tic                                o
                                                                                                                                                                                                                  om ltico y_si de
                                                                                                                                                                                                     op                pu mp ze_ l
                                                                                                                                                                                                                           te ut un
                                                                                                                                                                                                        er                    r_ er                  i
                                                                                                                                                                                                          at
                                                                                                                                                                                                             in                   m          _ t
                                                                                                                                                                                                                g-                   an ma
                                                                                                                                                                                                                   sy                   uf ke
                                                                                                                                                                                                                      st                    ac
                                                                                                                                                                                                                         em                    tu
                                                                                                                                                                                                                              _k gpu rer
                                                                                                                                                                                                                                  er _m
                                                                                                                                                                                                               in cpu ne
                                                                                                                                                                                                                  te                    l ak
                                                                                                                                                                                                                      rc _m _ve e
                                                                                                                                                                                                                                                                                                                          Figure 1: The Computational Environment ODP


                                                                                                                                                                                                                        on an rs
                                                                                                                                                                                                                             n e u f i on
                                                                                                                                                                                                                                            a
                                                                                                                                                                                                                       cp ctio ctu
                                                                                                                                                                                                                           u_ n-                  r
                                                                                                                                                                                                                               fre ne er
                                                                                                                                                                                                       op                gp qu two
                                                                                                                                                                                                          er                 u_ en rk
                                                                                                                                                                                                            at                    m cy_
                                                                                                                                                                                                                in                   an
                                                                                                                                                                                                                                        u un


                                              proximity in the underlying text.
                                                                                                                                                                                                                   g-
                                                                                                                                                                                                                      sy gpu fac it
                                                                                                                                                                                                                        st           _         t
                                                                                                                                                                                                                            em nu ure
                                                                                                                                                                                                                                 _k m- r
                                                                                                                                                                                                                        gp          e          c
                                                                                                                                                                                                                            u_ rne ore
                                                                                                                                                                                                                                ke l_ s
                                                                                                                                                                                                                                    rn na
                                                                                                                                                                                                                                       el         m
                                                                                                                                                                                                                                          _v e
                                                                                                                                                                                                                                              er
                                                                                                                                                                                                                                                 si o
                                                                                                                                                                                                                                                      n


                                                                                  Figure 2: The number of distinct values for each tag in the test set and the current and old training sets.
                                                                                                                                                                                                                                                                  Old
                                                                                                                                                                                                                                                                                  Testset

                                                                                                                                                                                                                                                                        Current
        80

        70

        60

        50

        40

        30
                                                                                                                        Test set
        20                                                                                                              Current

                                                                                                                        Old
        10

         0
                                                cy e

                                                 _m t
                                    o ufa ke


                                     m size ue
                                  so o r u n i t

                                               e_ e
                        em of pu ame

                                    rib _v el
                                   cp ion ion

                          og u m m e

                                       in ces s
                           p m an rs
               er m_ r_ er ge

                                     u fac l


                                               lo e

                                     m _m n


                                     m _m e
                                             uf ake


               at mp om r_ es

                                                          l
                          rc ke ufa ke


                                   gp n-n ion


                                              _v res

                                                         n
                                              m rer
                                    pu el_ on


                                                r_ l


                                     ct ve r


                                    ke m rk
                                      or e_v r


                            tic k _ve r
                       g- rib u de


                        sy r_ te de
                                             te de
                                  u_ cpu uni


                                 ne el_ re
                                 em iz re


                                 m -pro ore


                                                       e
                                            en lu


                                            ar typ


                       pr n nu am


                                            r_ m


                                  u_ gpu ak
                                 co gpu tio


                                                    si o
                                ist re od


              g- om co g-l so


                                u_ nu o
             er co ltic ute od
                      m stem tion tur
                                                     a


                     in tem an _ma
            op ste ute put gua
                                          y_ al


                               om ern rsi


                              tic n ctu


                              on rn ctu
                               m ry_s ctu


                                        rn -co
                                        qu _va


                     in st an o


                     g- ute pu mo
                                        pu o
                                        ut ers


                                                   rs
                                        te na


                             gp u_ etw
                                                  m
                                                  _


                                          m _


                                                 ca


                                                 er
                                                 -c
                                                  n


                                      u_ _n
                             _d twa _m


                  at di m _m
                                      ftw y_


                                       p -n
                                                a
                                   fre cy


                                     m r


                                           el
                               u_ en


                                            u
                                         an
                               em n


                                          c


                                        io
                                         a


                                        e
                            cp equ


                                     m


                                    m


                                    m
                                     _


                        te _
                                  o
                                  r
                               _f


                              gp
                              cp


                                s


                              ra
    u


                             m


         op lti u
                           ul
 cp


                          sy


                          m


                          s
                         ul


                         m
                      st


                     c
                   sy


                  in
                sy


                 u
                g-


              m
             in


           in
          at


        at
        er


      er
    op


   op


             Figure 3: The number of occurrences of each tag in the test set and the current and old training sets.


Preprocessing                                                      Machine Learning
                                                                   NER techniques generally fall into two categories: ma-
                                                                   chine learning-based and rules-based (Nadeau and Sekine
                                                                   2007). Machine learning-based NER systems typically in-
In academic articles, the computational environment tends          volve training a classifier by manually tagging input docu-
to be discussed in a small number of isolated points within        ments. The classifier then attempts to learn how to correctly
a document, often a single paragraph, sentence, footnote,          recognize and type entities from new documents based on
or caption. Because some of the techniques we employ in            the training set and a set of features such as a word’s part
our approach, particularly the use of Wikipedia, are time-         of speech, position in a document, prefix/suffix, frequency,
intensive, we begin our ontology population task by attempt-       etc. Various approaches have been employed to model the
ing to identify the key portions of each document (which we        relationship between the features and the entities, including
generically refer to as the “key paragraph”, with the under-       Support Vector Machines (Isozaki and Kazawa 2002), Max-
standing that it may actually be a footnote, caption or other      imum Entropy (Chieu and Ng 2002), and neural networks
element). Our goal in doing this was to arrive at an ontology-     (Lample et al. 2016). Because the task of manually generat-
agnostic approach with high recall. The approach we take           ing training data is onerous, there are also semi-supervised
is quite basic: a Python script is given the ontology (in the      and unsupervised machine learning-based NER approaches,
form of an OWL file) and the name of the academic article          but these often only rival, rather than exceed, the perfor-
in which to identify the key paragraphs. The script parses         mance of supervised systems (Nadeau and Sekine 2007) and
the names of all entities from the ontology, splits the entire     so are not considered here.
content academic article (including footnotes, captions, etc.)        In this work we applied a Conditional Random Field
into paragraphs, and counts the number of times any ontol-         (CRF) based classifier to the task of tagging information rel-
ogy term appears in each paragraph. Any paragraph with a           evant to the computational environment ontology. CRF was
count within 90 percent of the maximum count for that arti-        chosen due to its long-standing popularity for information
cle is considered a key paragraph. This approach produces a        extraction from unstructured text (Kristjansson et al. 2004;
recall of .94 and a precision of .99, for an F-measure of .96.     Bundschus et al. 2008). We again used the Stanford NLP
The remainder of the discussion in this section assumes that       group’s implementation (Finkel, Grenager, and Manning
the key paragraphs have been successfully identified prior to      2005). This classifier uses features such as word order, n-
invoking the approach under consideration. The text in each        grams, part of speech, and word shape to create a proba-
key paragraph is split into sentences, tokenized, lemmatized,      bilistic model to predict the tags in previously unseen docu-
and part of speech tagging is performed using the Stanford         ments. We developed models based on both the current and
NLP pipeline (Manning et al. 2014).                                old training sets, an example of which is shown in Listing 2
    Dataset    Model      Precision    Recall    F-measure             Dataset     Model       Precision     Recall    F-measure
    Training   Current    0.95         0.93      0.94                  Training    Current     0.96          0.93      0.95
               Old        0.95         0.93      0.94                              Old         0.99          0.94      0.97
    Test       Current    0.90         0.52      0.66                  Test        Current     0.79          0.67      0.73
               Old        1.00         0.01      0.02                              Old         0.80          0.28      0.41

Table 1: Performance of a machine learning-based approach                 Table 2: Performance of a rules-based approach


(the O symbol indicates that no tag is relevant for that token).                       ! { word : / ( ? i ) v e r s i o n / } ] ) ) ,
                                                                       a c t i o n : ( Annotate ( $ 1 , ner ,
   Listing 2: Tagged data used to train the CRF classifier                             ” o s d i s t r i b u t i o n name ” ) ) ,
running O                                                              stage : 2
on          O                                                      }
a           O                                                         Since the results of the machine learning-based approach
PC          O                                                      were mediocre, we also implemented a rules-based approach
with        O                                                      using the Stanford NLP group’s TokenRegex system (Chang
Linux       o s k e r n e l name                                   and Manning 2014). A strong effort was made to establish a
CentOS o s d i s t r i b u t i o n name                            set of rules for each training set that was as general as pos-
5           os d i s t r i b u t i o n v e r s i o n               sible while still maximizing F-measure. Of course, the more
,           O                                                      general the rules are, the more likely they are to conflict with
1           memory s i z e v a l u e                               one another at some point. Developing these rule sets was
GB          memory s i z e u n i t                                 therefore quite a difficult task, with each one taking about
memory O                                                           two days to create. These rule sets were then used to tag the
and         O                                                      articles from the test set. The performance is shown in Table
2.4         cpu f r e q u e n c y v a l u e                        2.
GHz         cpu f r e q u e n c y u n i t                             We again see that the performance of the rule set based
CPU         O                                                      on current articles drops on the test set, though it remains
                                                                   significantly higher than that of the machine learning-based
   We then used each model to tag the articles in the test set     approach (.73 versus .66 F-measure). Additionally, the per-
and assessed the performance in terms of precision, recall         formance of the rule set based on older articles, while only
and F-measure (Table 1). The F-measure of the approach             a little more than half that of the current rules, is much
on the training data is not quite 1.0 because tagging in the       more reasonable than that of the CRF classifier using the
Stanford NLP pipeline only happens at the level of tokens,         model trained on old articles (.41 versus .02 F-measure). We
and some of the articles contain malformed tokens, such as         tried combining the machine learning- and rules-based ap-
IntelCorei7, that contain information about more than one          proaches, but the combined performance was not any better
tag. Still, the performance of both models is quite good on        than that of the rules-based approach alone. We therefore
the training data. The F-measure using the model trained on        did not consider the machine learning-based approach any
current documents drops considerably on the test set, but          further for this ontology population task.
0.66 may good enough to be of some use in many appli-
cations (Others have found that the performance of NER in          Enhancing Rules with Background Knowledge
technical domains such as biomedicine is in the 58-75 range
(Zhang and Ciravegna 2011)). Conversely, the performance           As shown in the previous section, the performance of named
using the model trained on older articles is abysmal.              entity recognition in this domain degrades considerably over
                                                                   time. In this case, the rules created from articles at least ten
Rules                                                              years older than those being tagged were 44 percent less ef-
                                                                   fective than rules based on contemporaneous articles (.41
Rules-based NER systems allow users to craft rules using a         versus .73 F-measure). In looking into the specifics of this
regular expression language that can incorporate text, part        issue, one underlying problem is that some tags’ rules are
of speech, and previously assigned tags. An example rule,          based on other tags. For example, a computer’s make can
which states that any noun phrase that is between an oper-         often be recognized because it follows a computer’s manu-
ating system kernel name and the word “version” should be          facturer, as in Lenovo ThinkPad. If Lenovo is not recognized
tagged as an operating system distribution name, is shown          as a computer manufacturer there is a double hit to perfor-
in Listing 3.                                                      mance, because not only is that word not tagged correctly,
                                                                   but neither is ThinkPad. The common computer manufac-
                    Listing 3: Sample rule                         turers a decade ago (e.g. Silicon Graphics, Compaq, DEC)
{                                                                  are not common today, which leads to the observed per-
    ruleType : ” tokens ” ,                                        formance degradation. Even more problematic is that over
    p a t t e r n : ( [ { n e r : o s k e r n e l name } ]         time completely new technologies become relevant to de-
                    ( [ ( { p o s :NN} | { p o s : NNP} ) &        scriptions of computational environments. A good example
  Dataset     Model       Precision    Recall    F-measure        0.41 F-measure). The performance of the current rule set is
  Training    Current     0.93         0.93      0.93             reduced by 4 percent (0.70 versus 0.73 F-measure), indicat-
              Old         0.93         0.93      0.93             ing that this approach should not be used unless it is war-
  Test        Current     0.71         0.70      0.70             ranted by changes in the domain of interest and the age of
              Old         0.61         0.41      0.49             the model used for ontology population.

Table 3: Performance of a rules-based approach preceded by                  Conclusions and Future Work
Wikipedia-based annotations
                                                                  In this work we consider the problem of populating an on-
                                                                  tology from a fast-changing domain: computational envi-
is GPUs, which have only become popular for parallel pro-         ronments. We show that the performance of two popular
cessing relatively recently.                                      approaches degrades significantly over time and propose
    Our goal in this work was to reduce the performance           the use of a continuously updated background knowledge
degradation seen as rules age by leveraging a source of back-     source to mitigate this performance degradation. A basic
ground knowledge that is continuously updated. In addition,       implementation of this idea improved the performance of
we sought to develop an approach that does not require the        a rules-based approach based on older documents by 20
types of time-consuming training typical of machine learn-        percent. It remains to be determined if this technique can
ing and rules based methods. Ideally, the approach should         achieve similar results in other fast-changing fields, such
take the ontology (or desired set of tags) as input and require   as medicine or law. In addition, it is possible that this re-
no additional configuration.                                      sult can be improved upon by a more advanced use of
    The solution we developed uses Wikipedia as the back-         Wikipedia, such as through neural network based methods
ground knowledge source. We developed a custom NER                like word2vec (Mikolov et al. 2013). We plan to explore this
module that fits into the Stanford NLP pipeline. The mod-         in our future work on this topic.
ule takes the list of desired tags as input. We limit this list      All of the materials used in this project, including
to the tags that are not related to numeric values (i.e. we       the article set, answer set, ontology, machine learning
omit tags like num-processors and cpu num-cores) because          models, rules, and code, have been published to GitHub
querying Wikipedia for a number like “4” or “8” is not going      (https://github.com/mcheatham/compEnv-extraction).
to produce any useful information. We also provide common
synonyms for tags (i.e. “processor” for CPU and “company”                           Acknowledgments
for manufacturer). When tagging a document, the module            The first and third authors acknowledge partial support by
queries Wikipedia for each proper noun in the key para-           the National Science Foundation under award PHY-1247316
graph(s) and if a page is returned, determines which tag, if      DASPOS: Data and Software Preservation for Open Sci-
any, is most relevant by counting the number of times each        ence. The third author would like to acknowledge partial
tag appears in the first three sentences of the document and      support from Notre Dame’s Center for Research Comput-
weighting tags that appear earlier more than those that ap-       ing.
pear later. After the Wikipedia annotator finishes, the rules-
based annotator is run.
    Other researchers have also leveraged Wikipedia for vari-
                                                                                         References
ous aspects of the NER and ontology population tasks. Many        Bundschus, M.; Dejori, M.; Stetter, M.; Tresp, V.; and
of these efforts are focused on using Wikipedia to do mul-        Kriegel, H.-P. 2008. Extraction of semantic biomedical
tilingual NER (Nothman et al. 2013; Kim, Toutanova, and           relations from text using conditional random fields. BMC
Yu 2012). More related to our current work, Kazama and            bioinformatics 9(1):207.
Torisawa determine a candidate tag for an entity by extract-      Chang, A. X., and Manning, C. D. 2014. TokensRegex:
ing the first noun after a form of the word “be” within the       Defining cascaded regular expressions over tokens. Tech-
introductory sentence of its Wikipedia page and use that as       nical Report CSTR 2014-02, Department of Computer Sci-
one of the features in a CRF classifier (Kazama and Tori-         ence, Stanford University.
sawa 2007). Klieger et al. take a similar approach but rather
                                                                  Cheatham, M.; Charles Vardeman, I.; Karima, N.; and Hit-
than using a classifier, they leverage WordNet to find syn-
                                                                  zler, P. 2017. Computational environment: An odp to sup-
onyms and hypernyms of the type identified from Wikipedia
                                                                  port finding and recreating computational analyses. In 8th
in order to arrive at one of the tags of interest (Kliegr et
                                                                  Workshop on Ontology Design and Patterns –WOP2017.
al. 2008). A more thorough survey of NER approaches that
utilize Wikipedia can be found in (Zhang and Ciravegna            Chieu, H. L., and Ng, H. T. 2002. Named entity recog-
2011). The key difference between our work and existing           nition: a maximum entropy approach using global informa-
approaches in that the results have been analyzed in a way        tion. In Proceedings of the 19th International Conference
that enables the ability of Wikipedia to mitigate the atrophy     on Computational Linguistics, volume 1, 1–7. Association
of a rules-based technique to be specifically evaluated.          for Computational Linguistics.
    Our method leads to the results shown in Table 3. While       Finkel, J. R.; Grenager, T.; and Manning, C. 2005. Incorpo-
this approach is relatively basic, we see that it improves the    rating non-local information into information extraction sys-
performance of the old rule set by 20 percent (0.49 versus        tems by Gibbs sampling. In Proceedings of the 43rd Annual
Meeting of the Association for Computational Linguistics,        Zhang, Z., and Ciravegna, F. 2011. Named entity recogni-
363–370. Association for Computational Linguistics.              tion for ontology population using background knowledge
Isozaki, H., and Kazawa, H. 2002. Efficient support vec-         from Wikipedia. In Ontology Learning and Knowledge Dis-
tor classifiers for named entity recognition. In Proceedings     covery Using the Web: Challenges and Recent Advances.
of the 19th International Conference on Computational Lin-       IGI Global. 79–104.
guistics, volume 1, 1–7. Association for Computational Lin-      Zhou, L.; Cheatham, M.; Krisnadhi, A.; and Hitzler, P. 2018.
guistics.                                                        A complex alignment benchmark: Geolink dataset. In Inter-
Kazama, J., and Torisawa, K. 2007. Exploiting Wikipedia          national Semantic Web Conference, 273–288. Springer.
as external knowledge for named entity recognition. In Pro-
ceedings of the 2007 Joint Conference on Empirical Meth-
ods in Natural Language Processing and Computational
Natural Language Learning (EMNLP-CoNLL).
Kim, S.; Toutanova, K.; and Yu, H. 2012. Multilingual
named entity recognition using parallel data and metadata
from wikipedia. In Proceedings of the 50th Annual Meet-
ing of the Association for Computational Linguistics: Long
Papers-Volume 1, 694–702. Association for Computational
Linguistics.
Kliegr, T.; Chandramouli, K.; Nemrava, J.; Svatek, V.; and
Izquierdo, E. 2008. Combining image captions and visual
analysis for image concept classification. In Proceedings
of the 9th International Workshop on Multimedia Data Min-
ing: held in conjunction with the ACM SIGKDD 2008, 8–17.
ACM.
Kristjansson, T.; Culotta, A.; Viola, P.; and McCallum, A.
2004. Interactive information extraction with constrained
conditional random fields. In AAAI, volume 4, 412–418.
Lample, G.; Ballesteros, M.; Subramanian, S.; Kawakami,
K.; and Dyer, C. 2016. Neural architectures for named entity
recognition. arXiv preprint arXiv:1603.01360.
Manning, C. D.; Surdeanu, M.; Bauer, J.; Finkel, J.; Bethard,
S. J.; and McClosky, D. 2014. The Stanford CoreNLP nat-
ural language processing toolkit. In Association for Compu-
tational Linguistics (ACL) System Demonstrations, 55–60.
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and
Dean, J. 2013. Distributed representations of words and
phrases and their compositionality. In Advances in Neural
Information Processing Systems, 3111–3119.
Nadeau, D., and Sekine, S. 2007. A survey of named entity
recognition and classification. Lingvisticae Investigationes
30(1):3–26.
Nakashole, N.; Tylenda, T.; and Weikum, G. 2013. Fine-
grained semantic typing of emerging entities. In Proceed-
ings of the 51st Annual Meeting of the Association for Com-
putational Linguistics (Volume 1: Long Papers), volume 1,
1488–1497.
Nothman, J.; Ringland, N.; Radford, W.; Murphy, T.; and
Curran, J. R. 2013. Learning multilingual named entity
recognition from wikipedia. Artificial Intelligence 194:151–
175.
Petasis, G.; Karkaletsis, V.; Paliouras, G.; Krithara, A.; and
Zavitsanos, E. 2011. Ontology population and enrichment:
State of the art. In Knowledge-driven multimedia informa-
tion extraction and ontology evolution, 134–166. Springer-
Verlag.