=Paper=
{{Paper
|id=Vol-2161/paper27
|storemode=property
|title=Ontology Population for Open-Source Intelligence (Discussion Paper)
|pdfUrl=https://ceur-ws.org/Vol-2161/paper27.pdf
|volume=Vol-2161
|authors=Ganino Giulio,Domenico Lembo,Massimo Mecella,Federico Scafoglieri
|dblpUrl=https://dblp.org/rec/conf/sebd/GaninoLMS18
}}
==Ontology Population for Open-Source Intelligence (Discussion Paper)==
<pdf width="1500px">https://ceur-ws.org/Vol-2161/paper27.pdf</pdf>
<pre>
      Ontology Population for Open-Source Intelligence
                    (Discussion Paper)

     Giulio Ganino, Domenico Lembo, Massimo Mecella, and Federico Scafoglieri

                      DIAG, Sapienza Università di Roma, Italy
          {lembo,ganino,mecella,scafoglieri}@diag.uniroma1.it


         Abstract. We present an approach based on GATE (General Architecture for Text
         Engineering) for the automatic population of ontologies from text documents. We
         describe some experimental results, which are encouraging in terms of extracted
         correct instances of the ontology. We then focus on a phase of our pipeline and
         discuss a variant thereof, which aims at reducing the manual effort needed to
         generate pre-defined dictionaries used in document annotation. Our additional
         experiments show promising results also in this case.


1     Introduction

Open-Source INTelligence (OSINT) is intelligence based on publicly available resources,
such as news sites, blogs, forums, etc. OSINT is nowadays used in many application
scenarios, like, for instance, security, market intelligence, or statistics. A major issue in
using Internet as a data source is that Web data come mainly in the form of free text,
thus with no structure and formal semantics. This therefore requires that two problems
have to be faced, that is, how to derive structured information from unstructured text,
and how to interpret the derived information according to a precise semantics.
    Towards a comprehensive solution to this problem, in this paper we investigate
how to populate a domain ontology with the information extracted from documents
crawled from the Web. Ontologies are indeed nowadays recognized as the best means
to represent domain knowledge at the conceptual level, and thus are particularly suited
to define the concepts and relationships of interest for an application. Thus, structuring
Web data according to the predicates and axioms defined in an ontology turns out to be
particularly effective to our aims, even in the light of the reasoning abilities ontologies
allow for [2]. It is worthwhile however to point out that we do not consider here the task
of extracting (the intensional/schema level of) ontologies in an automatic way from text
documents. Domain ontologies indeed have usually a very complex intensional structure,
and their design is typically carried out manually, since automatic ontology construction
approaches are in general not able to deal with all the needs of the specific application
domain at hand [7, 1]. In this work we instead pursue an approach for information
extraction from text that assumes that a domain ontology is already available, and that
text should be mined to extract instances of ontology predicates.
    SEBD 2018, June 24-27, 2018, Castellaneta Marina, Italy. Copyright held by the author(s).
    To solve this task, Named Entity Recognition (NER) [3] can be initially adopted, but
this approach is mainly focused on instantiating specific concepts (persons, places, or
organizations), without considering relationships among them. Relational Information
Extraction attempts in addition to identify and instantiate the relationships existing among
concepts. We will adopt both approaches, by relying on open-source technologies.
    Specifically, we make use of the GATE system1 , given its ability to customize
the underlying architecture components and to incorporate other external components
developed by third parties. Information extraction in GATE is carried out through
different stages, each depending on the contingent needs of the user, who can either
adopt existing dictionaries (a.k.a. Gazetteers) or create different extraction rule sets
through the adoption of Java Annotation Patterns Engine (JAPE) language [4]. GATE
has become popular in the last years, especially in relation to information extraction
from English documents. To some extent, it also supports other languages, primarily
thanks to the dictionaries created and then shared on the platform by its many users.
    We have tested our techniques within the XASMOS and RoMA projects, involving
the Leonardo company and the Sapienza research center on Cyber Intelligence and
Information Security. In the projects, we have considered the case study “Mafia Capitale”,
from the name of an important inquiry of 2015 that received a lot of attention by the
Italian media, and thus turned out to be a valid testbed (for both number of available
Web documents and significance of the domain). For this work, we have created specific
dictionaries and JAPE rules for the Italian language to instantiate, with the information
acquired from text documents, an ontology designed for our case study. We present the
case study successfully applied to more than 2600 documents crawled from the Web.
    We have experienced that some of the tasks we conducted, such as dictionary or JAPE
rule definition, are rather domain specific and time-consuming. We investigated how to
refine our approach so that the time needed for these tasks is reduced and the solution
adopted can be more easily reused in different contexts. In particular, after describing our
approach and the use case, we propose a general technique for dictionary construction,
which relies on the extraction of gazetteers lists from the open KB Wikidata2 .
    This paper is an extended abstract of [6].

2     Background
Natural Language Processing (NLP). NLP aims at solving problems related to the
automatic generation and understanding of human language [9]. The process carried out
by NLP is made up of several steps:
    – Tokenization: This task breaks down the raw text in tokens, which can be words,
      spaces or dots.
    – Part of Speech (POS): This phase aims at labeling each word with a unique tag
      indicating its syntactic role, i.e. Noun, Verb, or Pronoun.
    – Named Entity Recognition (NER): NER labels atomic elements in a sentence into
      categories (such as “Person” or “Location”) through the application of specific rules
      or statistical machine learning techniques [8].
 1
     https://gate.ac.uk/
 2
     https://www.wikidata.org/
    – Semantic Role Labelling (SRL): SRL gives a semantic role to a syntactic constituent
      of a sentence, and add further labels to words in the documents. SRL aims at
      understanding the meaning of an entire sentence starting from the meaning of each
      word taken in isolation and the relationship existing among the words [11].
General Architecture for Text Engineering (GATE). GATE, is an architecture, a
framework and a development environment for Language Engineering (LE) [5]. It has a
component-based model, allowing for easy combination of Processing Resources (PRs),
thereby facilitating comparison of alternative configurations of the system or different
implementations of the same module (e.g., different parsers). GATE comprises a core
library and a set of reusable LE modules, which perform basic language processing
tasks, such as POS and semantic tagging. This provides a good starting point for new
applications. The modules used in this work are described in Section 3.
Ontologies. An ontology is a formal description of an abstract, simplified view of a
certain portion of the world [7]. Ontologies can be naturally used to represent knowledge
on the web, where they are mainly adopted to add semantics to data. This also enables
the usage of the reasoning mechanisms ontologies are equipped with [2]. The importance
of ontologies to interpret and structure Web data is testified also by the standardization
effort carried out by the W3C, which led to the definition of OWL3 .
    As usual in ontologies, OWL distinguishes between intensional and extensional
knowledge. Intensional knowledge is given in terms of logical axioms involving Classes
(a.k.a. concepts) and properties, which are of two types, ObjectProperties (a.k.a. binary
relationships or roles) and DataProperties (a.k.a. attributes). Classes denote sets of ob-
jects, ObjectProperties denote binary relations between objects, whereas DataProperties
denote binary relations between objects and values from predefined datatypes. Person,
livesIn and personAge are examples, of Class, ObjectProperty, and DataProperty, re-
spectively. At the extensional level, an OWL ontology is a set of assertions about its
instances. For example, ClassAssertion(Person John) indicates that the individual John
is an instance of Person, whereas ObjectPropertyAssertion(livesIn John NY) specifies
that the pair of individuals (John, NY) is an instance of livesIn.

3     Approach
Our approach comprises two phases: Semantic Annotation and Ontology Population.
Semantic annotation. In this stage we create annotations, i.e., metadata that indicate
properties of the text contained in the analyzed documents. At the end of this phase,
the annotations will allow us to identify those entities that are indeed instances of the
classes and properties of the ontology given as input to our pipeline. In our GATE-based
approach, this phase relies on several PRs, which are available as GATE plugin-ins,
possibly provided by third-party organizations4 . Such resources are described below.
 1. GATE Unicode Tokeniser: this is a PR of the GATE core library used to split the
    text in Tokens and SpaceTokens, where the latter denote spaces among single terms,
    whereas the former are numbers, punctuations, symbols, and words.
 3
     Ontology Web Language (https://www.w3.org/TR/owl2-primer/)
 4
     https://gate.ac.uk/gate/doc/plugins.html
 2. RegEx Sentence Splitter: it is another PR of the GATE core library. It is used to divide
    documents into sentences. It is essentially language-independent, and thus can be
    used as is for very many languages, including Italian. It is implemented in Java, and
    is based on regular expressions that define syntactic rules for sentence identification.
    At the end of this phase two new annotations are added to the document, i.e.,
    Sentences and Splits, the latter indicating separation between sentences.
 3. TreeTagger POS5 : it is a component for document annotation with POS and lemma
    information, developed at University of Stuttgart. It is a Markov Model tagger which
    makes use of a decision tree.
 4. Gazetteer [4]: this GATE built-in component annotates the documents on the basis
    of a set of lists of words, such as names of cities or organisations, or abbreviations for
    types of companies (e.g., ltd., Corp., Inc.). Each list can be associated with so-called
    major and minor types. These types correspond to categories, such that minor types
    are more specific than the corresponding major types. If the document contains a
    string matching with an element of a Gazetteer list, the component annotates the
    string with the major and minor type of this list. In case the string has more than one
    match, major and minor types of all the matching lists are added.
 5. JAPE Transducer: this component is used to import user-written JAPE rules into the
    GATE platform. A JAPE program is constituted by a set of pattern/action rules, such
    that the left-hand side of a rule consists of an annotation pattern description, and the
    right-hand side consists of an annotation manipulation statement. The JAPE language
    allows to recognize regular expressions among the annotations produced from the
    PRs that run before the JAPE Transducer. Once the expression is found, a further
    annotation referring to the searched patterns/entities is added to the document [4].

Ontology population. It is the process of inserting instances into a domain ontology. To
this aim we used three PRs, two of which are JAPE-based modules that essentially convert
annotations created in the semantic annotation phase so that they are compatible with
the format used by the third PR, called OwlExporter6 , which is the PR that concretely
extracts ontology instances in the form of OWL assertions from the annotated text [6].
The OwlExporter manages two ontologies: a domain specific ontology, like the one we
developed for our use case, and a domain-independent NLP ontology, which contains
concepts commonly used in LE, like paragraphs, sentences, or noun phrases [10].


4    Case Study
We used Web Content Extractor7 to retrieve articles concerning the “Mafia Capitale”
inquiry and appeared on the Web portals of some major Italian newspapers between June
16, 2015, and February 29, 2016. The crawling phase generated 2657 articles.
    We have then defined a domain ontology to represent some aspects of interest for the
domain at hand. Our ontology is defined on a alphabet of 21 Classes, 9 ObjectProperties
and 14 DataProperties. We have then manually created specific Gazetteer lists and JAPE
 5
   http://www.cis.uni-muenchen.de/˜schmid/tools/TreeTagger
 6
   http://www.semanticsoftware.info/owlexporter
 7
   http://www.newprosoft.com/web-content-extractor.htm
                                 Table 1. Results of test #1

 Words     Annotations    CA    OPA     Correct    Missing     Precision   Recall
 6.084        414         404    10      403         98          97%        78%


rules for document annotation through the GATE PRs described in Section 3. We remark
that the both Gazetteer lists and JAPE rules are specific for the Italian language.
     To evaluate our approach, we carried out two tests, both performed on a MacBook
Pro 9.2, with Intel Core i7 2,90 GHz and 8GB RAM. In the first test, we have chosen 10
documents, in a random way, among those produced by the crawling phase, and have
asked a domain expert to annotate them, to identify all and only the instances of the
ontology predicates that can be extracted by these documents. We have then compared
the annotations produced by our pipeline with the annotations of the domain expert,
by computing the Precision (P ) and the Recall (R). We remind that P = C/(C + S),
where C + S is the number of annotations produced through our pipeline, some of
which are indeed correct (C), if compared with the annotations identified by the expert,
whereas others are spurious (S) (i.e., false positive), whereas R = C/(C + M ), where
M denotes the annotations missed by our pipeline, with respect to those identified by
the expert (i.e., false negative). The results of this first test are given in Table 1, where
Words indicates the number of words contained in the analyzed documents, Annotations
is the total number of annotations produced by our pipeline, i.e., C + S, CA and OPA,
denote the number of Class and ObjectProperty assertions extracted, respectively (to
keep the expert annotation task simple, we did not consider DataProperty annotations).
     We got excellent Precision and good Recall, a significant number of class assertions,
but few ObjectProperty assertions. As for the precision, the errors we obtained were only
due to the wrong annotations associated to the word “Marino”, which is both a city and
the last name of a Rome ex-mayor, since our JAPE rules were not able to disambiguate
this situation. We notice however, that such kind of problem can be solved by adding
other PRs to our pipeline, like the Pronomial Coreference (PC)8 , which provides an
annotation on pronouns that refer to already annotated entities. This PR is currently
available in GATE only for the English language, and some extension is needed to make
it work also for the Italian.
     The value we obtained for the Recall is mainly affected by the fact that we were not
able to identify many ObjectProperty assertions. Whereas for this aspect there is much
space for improvement, e.g., by adding further JAPE rules, we believe that the usage of
the PC mentioned above can be helpfull also to increase the Recall, since many links
between entities can be discovered by correctly annotating pronouns.
     In the second test we have processed in our pipeline all the 2657 articles produced
by the crawling phase. The results are shown in Table 2, where also the overall running
time needed to process all documents is given. In the table, DPA denotes the number of
DataProperty assertions we were able to extract. Due to the large number of documents
used in this test, we could not measure overall precision and recall, but we mainly used
this test to evaluate the impact of our approach on a large text corpus. Our results are
 8
     https://gate.ac.uk/sale/tao/splitch6.html#sec:annie:
     pronom-coref
                                   Table 2. Results of test #2

                      Words     CA      OPA     DPA      Exec.time (sec.)
                    1.285.290 38.452    419    99.145 1.948,88 (∼30 mins)


encouraging, in particular for the number of CA and DPA extracted. Furthermore, to get
an idea of the quality of the result, we measured the precision on the 1% of the data in
the output (involving a domain expert) and we have got a value of 94%.


5     Simplifying Gazetteer Lists Generation
As said in Section 3, the Gazetteer PR annotates the documents on the basis of a set of
lists of words, used to find occurrences of these words in the text, (e.g., for the NER
task). In our use case, it was crucial to create all the Gazetteer lists we used, because
all the articles we analyzed are in Italian, and when we started our project no reusable
support for Gazetteer lists for this language was available. To create the lists related to
the domain entities of “Mafia Capitale”, we use two methods:

    – based on Open Data Sources (ODS): Lists have been downloaded (and then manually
      refined) from open data Web sites. In particular, we did this for the first and last name
      of individuals belonging to a specific category, e.g., the members of the Government
      of the city of Rome. To this aim we have downloaded specific lists of interest from,
      e.g., OpenPolitici9 , for the lists of Politicians currently in charge, and also from
      other data sources, such as DBpedia.
    – based on Human-Knowledge (HK): When we could not find a specific list in an
      open data source, we resorted to a completely manual construction of the dictionary,
      by exploiting personal knowledge on the topic and web searches. This has been
      in particular done to construct lists of words that identify certain categories (e.g.,
      containing all phrases used to mean “Lawyer”).

    The construction of such lists through this approach was non-trivial and required a
significant amount of time. We thus investigated alternative methods for Gazetteer lists
generation that could reduce the manual effort required, and could be easily replicable
in different domains, thus augmenting the generality of our approach. To this aim, we
focused on Wikidata as an informative source for Gazetteer lists production.
    Wikidata is a collaboratively edited knowledge base, which organizes a large amount
of data in a structured way according to a general reference ontology. It is an openly
accessible resource, following the Semantic Web standards for exporting, interconnecting
and querying data, which can be edited and read by both machine and humans. We
have downloaded a Wikidata RDF dump (specifically we used the version of March 3,
2017), and have loaded it on a graph database management system with RDF/SPARQL
support, namely Blazegraph10 . This has enabled us to access Wikidata information
 9
     http://politici.openpolis.it/
10
     https://www.blazegraph.com/
  Table 3. Annotations through Gazetteer lists based on ODS/HK vs. those based on Wikidata


List                                Entries Annotations Spurious Ann. Missing Ann. Precision Recall
First Names ODS                      8913      322           83             0        74% 100%
First Names Wikidata                 2517      262           31             5        88%      98%
Criminal Organizations HK             25       37            21             0        43% 100%
Criminal Organizations Wikidata       68       20            18            14        10%      12%
Criminal Organizations Wikidata Mod. 273       30            24            10        20%      37%
Politicians ODS                      1083      11             0            40       100% 22%
Politicians Wikidata                 7778      17             0            34       100% 33%
Journalists ODS                       369       4             0             4       100% 50%
Journalists Wikidata                 3727       8             0             0       100% 100%
Political Parties ODS                 131      23             0             2       100% 92%
Political Parties Wikidata            447       4             0            21       100% 16%
Political Parties Wikidata Mod.      1293      26             7             6        73%      76%


through standard SPARQL queries. By virtue of this approach it is possible to avoid
tedious and time-consuming development of ad hoc solutions for each wanted Gazetteer
list. Indeed, once the setup of the system is completed, a user just needs to define and
execute a set of SPARQL queries to obtain the lists of interest. Such queries are specified
taking into account the lists to populate, the Wikidata ontology structure, and its data
model. The result of each SPARQL query is a CSV file, which has to be simply renamed
into a .lst file to be processed by the Gazetteer PR.
     To compare the two approaches, we considered five lists containing Italian Politicians,
Journalists, First Names, Criminal Organizations, and Political Parties, respectively. Each
list has been produced and used in two versions, i.e., ODS/HK based and Wikidata based.
The comparison has been done on 15 articles and followed the same criteria of the first
test described in Section 4: a hand-based annotation aimed at identifying entities in the
five chosen lists has been done by a domain expert; then, the articles have been annotated
through the Gazetteer PR by using the two different sets of lists generated by the two
approaches. The results are shown in Table 3.
     From an analysis of the lists created by Wikidata we noticed that they contain various
entries with typos and flaws, which was expected to some extent. Our experiments
however show that the Precision value for annotations with the Wikidata lists is the same
or better than Precision with ODS/HK lists in all cases but the Criminal Organizations
one. For several Wikidata lists, also the Recall is close to the one measured for the
corresponding ODS/HK list. For the cases of Criminal Organizations and Political
Parties we instead initially got very low values for the Recall. This was mainly due
to problems with uppercase/lowercase letters (since GATE is case sensitive). We thus
refined the two lists by adding for each term the three versions: all uppercase letters,
all lowercase letters, and capitalized words. In Table 3 these new lists are denoted as
Wikidata Mod. We can observe that with this fix the value of Recall become greater than
that of the original Wikidata lists, and has reached acceptable levels. However, for the
Political Parties Wikidata Mod. list this approach has generated a decrease of Precision.
This has been caused by the introduction of new entries, e.g., the word si (which is the
lowercase version of the acronym SI, an Italian Political Party). This word is indeed used
in Italian as reflexive pronoun, thus leading to some wrong annotations. Although the
decrease of Precision, we notice that the modification of the Political Parties list increase
the Recall of 60%.
    From the data present in Table 3, we can conclude that our experiments have been
quite successful, since the Wikidata based approach has adequate values in terms of
number of annotations, Precision and Recall. Realistically, these values keep unchanged
when the number of analysed articles increases. We believe that this confirms that
resorting to Wikidata for producing lists to be used by the Gazetteer PR allows for both
time savings and good results.

6    Conclusions
Future improvements of our approach include the insertion of the PC component dis-
cussed in Section 4 in our pipeline. Also, we are working on further refinement of our
solution in order to simplify some of the phases that now require a manual intervention,
in the spirit of the work done on Gazetteer lists described in Section 5. In particular, we
are currently investigating a way to streamline the definition of JAPE rules.
Acknowledgments. This work has been partly supported by Leonardo Company in
the context of the XASMOS initiative, and by the Italian project RoMA (SCN 00064).
Giulio Ganino has been supported by the FILAS grant Laboratori teorico-sperimentali
a supporto delle applicazioni spaziali delle industrie laziali (FILAS-RU-2014-1058).

References
 1. N. Antonioli, F. Castanò, S. Coletta, S. Grossi, D. Lembo, M. Lenzerini, A. Poggi, E. Virardi,
    and P. Castracane. Ontology-based data management for the Italian public debt. In Proc. of
    FOIS ’14, pages 372–385, 2014.
 2. F. Baader, D. Calvanese, D. McGuinness, D. Nardi, and P. F. Patel-Schneider, editors. The De-
    scription Logic Handbook: Theory, Implementation and Applications. Cambridge University
    Press, 2nd edition, 2007.
 3. R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural
    language processing (almost) from scratch. J. Mach. Learn. Res., 12:2493–2537, Nov. 2011.
 4. H. Cunningham. Developing Language Processing Components with GATE Version 8.
    University of Sheffield Department of Computer Science, 2014.
 5. H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan. GATE: A Framework and
    Graphical Development Environment for Robust NLP Tools and Applications. In Proc. of
    ACL’02, 2002.
 6. G. Ganino, D. Lembo, M. Mecella, and F. Scafoglieri. Ontology population for open-source
    intelligence: a GATE-based solution, 2018. Submitted to an Int. Journal.
 7. N. Guarino. Formal ontology in information systems. In Proc. of FOIS ’98, Frontiers in
    Artificial Intelligence, pages 3–15. IOS Press, 1998.
 8. M. Johnson, S. Khudanpur, M. Ostendorf, and R. Rosenfeld. Mathematical Foundations of
    Speech and Language Processing. Springer New York, 2004.
 9. R. Navigli. Word sense disambiguation: a survey. ACM COMPUTING SURVEYS, 41(2):1–69,
    2009.
10. R. Witte, N. Khamis, and J. Rilling. Flexible ontology population from text: The OwlExporter.
    In Proc. of LREC’10, may 2010.
11. H. Zhao, X. Zhang, and C. Kit. Integrative semantic dependency parsing via efficient large-
    scale feature selection. J. of Artificial Intelligence Research, 46:203–233, 2013.

</pre>