Proceedings of the KI 2017 Workshop on Formal and Cognitive Reasoning


              Unsupervised Text Annotation

                  Tanya Braun, Felix Kuhr, and Ralf Möller

         Institute of Information Systems, Universität zu Lübeck, Lübeck
                  {braun,kuhr,moeller}@ifis.uni-luebeck.de


      Abstract. We introduce the unsupervised text annotation model UTA,
      which iteratively populates a document-speciﬁc database containing the
      related symbolic content description. The model identiﬁes the most re-
      lated documents using the text of documents and the symbolic content
      description. UTA extends the database of one document with data from
      related documents without ignoring the precision.

      Keywords: Unsupervised Text Annotation


1   Introduction

In recent years, knowledge graph systems have emerged using methods of infor-
mation extraction (IE) and statistical relational learning (SRL) to extract data
from documents and build knowledge graphs representing a symbolic content
description in a graph using entities and relations.
    Some of the most known knowledge graph systems are DeepDive [2], Never
Ending Language Learner (NELL) [5], YAGO [6] and KnowledgeVault [3]. These
systems generate knowledge graphs using documents and external sources avail-
able on the web without a speciﬁc user in mind. The systems often produce
graphs with billions of edges linking millions of nodes. The systems and have
some form of a supervision, either through interaction or prior information. The
number of entities and relations lead to a huge graph such that the approach
of these knowledge graph systems is usually not tailored towards special tasks
other than the accumulation of data. Additionally, most of the systems only
make use of the entities and relations but ignore the original text document
after extracting entities and relations from the text.
    Given the limitation that most knowledge graph systems produce huge graphs,
we present the approach Unsupervised Text Annotation (UTA) focussing on
document-speciﬁc data representation linking each document to a graph database
(DB) where graphs represent entities and relations extractable from the linked
text document. An iterative process extends the graph with entities and rela-
tions, which are not directly extractable from the document itself but extractable
from the related document. An important aspect of the approach is that it does
not ignore the text of the documents after extracting entities and relations, but
uses the text as a supporting mechanism to decide if entities and relations are
relevant for a document-speciﬁc content description.


                                      23
Proceedings of the KI 2017 Workshop on Formal and Cognitive Reasoning


    UTA employs methods from IE and SRL to address the following problems:
(i) Adding entities and relations to document speciﬁc DBs by hand is time-
consuming. (ii) Association is complex as it has to add relevant entities and
relations without adding irrelevant data to achieve high precision and recall.
Precision and recall compare the document-speciﬁc DBs against the DBs built
by human text annotators.
    To handle the ﬁrst problem, we use IE techniques to extract entities and
relations for documents. For the second problem, we work on a small subset of
documents to enrich a document speciﬁc DB using topic modeling techniques to
identify related documents. Instead of simply adding entities and relations to the
DB, we perform a text segment analysis of the text segments. Two documents,
sharing the same entities, might contain content about two completely diﬀerent
subjects. Only if the two text segments are similar, the entities and relations of
related documents extend the DB.
    Thus, UTA contains an algorithm that ﬁrst identiﬁes for each document
other related documents and second extracts entities and their relations from the
documents to extend the document-speciﬁc DB with relevant data. To generate a
document-speciﬁc DB, we need to (i) extract entities and corresponding relations
from documents, (ii) perform localization of entities and relations within the text
document, (iii) enrich a document-speciﬁc DB with data from related documents.
We name data from related documents as associative data.
    The remainder of this paper is structured as follows. We ﬁrst introduce back-
ground information on topic modeling techniques and information extraction. We
then present the algorithm to generate and extend document speciﬁc graphs.
Next, we provide a case study to show the potential of the unsupervised text
annotation approach. Last, we present a conclusion.


2   Preliminaries
This section presents latent Dirichlet allocation (LDA) topic model and brief
information about IE.

Topic Modeling Techniques Topic modeling techniques learn topics from a col-
lection of documents and calculate for each of the documents a topic probability
distribution (aka topic proportions). Topics represent co-occurring words of the
documents but have no high-level description, like sports or ﬁlm-industry. LDA
[7] is a well known generic topic model providing good results while being rela-
tively simple. It uses a bag of words approach simplifying documents and topics
depend only on a document’s word frequency distribution. For a document d,
LDA learns a discrete probability distribution θ(d) that contains for each topic
k ∈ {1, . . . , K} a value between 0 and 1. The sum over all K topics for d is
1. To ﬁnd similar documents we use the Hellinger distance [9] measuring the
distance between two probability distributions. Given two topic proportions θdi
and θdj for documents di and dj , the Hellinger distance H(θdi , θdj ) is given by
�K ��              �      �2
   k=1     θdi ,k − θdj ,k where k refers to the topics in the documents.


                                     24
Proceedings of the KI 2017 Workshop on Formal and Cognitive Reasoning


Information Extraction IE, a subdomain of natural language processing, refers
to methods that extract entities and their relations from text. Two main tasks
of an IE system are named-entity recognition and relation extraction. A possible
outcome of an IE system is a set of Resource Description Framework (RDF)
triples containing the extracted entities and relations for a document. Identifying
entities and relations within arbitrary long sentences containing subordinate
clauses and other grammatical structures makes the task diﬃcult.
    An example IE system is the OpenIE library from Stanford based on [1]. The
IE system consists of two stages: (i) Learn a classiﬁer to split sentences of text
documents into shorter utterances. (ii) Apply natural logic to further shorten
the utterances in a way such that the shortened utterances can be mapped to
OpenIE triples representing subject, predicate, and object.


3   Unsupervised Text Annotation

This section presents a formal description of the unsupervised text annotation
approach, which extracts entities and relations directly from a text document
and enriches the DB with data from related documents. We use the term local
DB to refer to a document-speciﬁc graph representing a content description.
We distinguish between (i) extractable data, which is directly extractable from
the text of a document (ii) associative data, which is part of another document’s
local DB, and (iii) inferred data, which is not part of any local DB and therefore,
new at a global level. The overall objective of UTA is to iteratively enrich local
DBs with data from other documents from the same document corpus.
    Deﬁnition 1 (Document corpus): Let d be a document and {di }ni=0 be a set
of n documents, then D is the document corpus representing the set {di }ni=0 .
    Each document di in D links to a local DB gi , where gi is a graph representing
the entities and relations between the entities.
    Deﬁnition 2 (Graph): A graph is a pair g = (V, E) consisting of a set of V
and E, where the elements of V are vertices, representing entities, and the ele-
ments of E are edges, representing relations between entities. Edge e = (u, v, r, l)
presents a relation r between vertices u and v and contains the localization l.
The localization l refers to the text segment of the corresponding document
containing the entities and relation.
    Enriching a DB gi of document di is the process of extending gi with data
from related documents. The global DB G of D is given by the union of all local
DBs ∪i∈D Gi .
    For a document di ∈ D and gi ∈ ∅ adding data to gi requires the following
steps (i) remove di from D, (ii) obtain entities and their relations from di using
IE techniques, (iii) localization of entities and relations to the corresponding
position in the text of di , (iv) identiﬁcation of documents Ddi which are related
to di , (v) enrich gi with data from Gdi , (vi) return di to D.
    Corpus D, and thereby global DB G, is bound to become too large at some
point. Thus, for enriching gi , only a subset Dd ⊆ D of documents related to di
is selected to enrich the DB gi .


                                      25
Proceedings of the KI 2017 Workshop on Formal and Cognitive Reasoning


Algorithm 1 Iterative Database Construction
  while true do
     remove(D, di )
     if gi = ∅ then
         gi ← IE(di )
     end if
     Dd ← select(D, d)
     Gd ← ∪j∈Dd gj
     gi ← enrich(Gdi , gi )
     add(D, di )
  end while


    Algorithm 1 shows a pseudo code description of the iterative DB construction
with procedures select(D, d) and enrich(Gd , g) for selecting Dd , and enriching
the DB g of document d with data from Dd . We reach a ﬁx point where, unless
we add new data through a new document, no changes in the local DBs occur.
    The following sections detail about identifying related documents and en-
riching documents’ DBs.


3.1   Document Selection

The selection of documents consists of two parts, (i) identifying similar docu-
ments for document d using matching techniques and (ii) building a set Dd from
matching documents.

Document Matching Enriching di ’s DB gi with data from other DBs gj requires
the identiﬁcation of documents containing content matching the content of cur-
rent document di with respect to the following aspects: (i) similarity in topic
proportions, (ii) subgraph isomorphism between graph gi and gj .
    Both matching techniques produces a set of documents, Dd,top and Dd,iso .
Dd represents the union of both, Dd,top and Dd,iso . For both techniques, we
specify assumptions and selection criteria.
    The ﬁrst technique, similarity in topic proportions, uses the text of docu-
ments. We assume that two documents sharing similar topic proportions contain
content of related topics. Thus, one’s DB might enrich the other’s DB. Document
dj is part of Dd,top if H(θd , θdj ) < t1 . Adding new documents to the corpus D
changes the topics. Hence, an update of topics and topic proportions is required
after the document corpus is extended with new documents.
    The second matching technique performs comparison on the level of local
DBs. We assume that related documents share entities and relations, i.e., have
the same vertices or edges, and thus, contain information of interest for the


                                    26
Proceedings of the KI 2017 Workshop on Formal and Cognitive Reasoning


other. We adopt a heuristic based on the Jaccard index to decide if a document
dj is relevant for another document di . The Jaccard index for two local DBs gi
                                           |g ∩g |
and gj is given by J(gi , gj ) = |gii ∪gjj | . We use J with DBs as inputs meaning,
we compare DB elements for equality. For J, we exclude the text segment lo-
calization variable l from all edges and consider partial overlaps of vertices and
edges within the graphs representing the DBs. Given two edges e� := (u� , v � , r� , l� )
and e�� := (u�� , v �� , r�� , l�� ), we say e� and e�� have a full overlap (and are equal)
iﬀ u� = u�� , v � = v �� , and r� = r�� . A partial overlap occurs if a non-empty
subset of vertices (entities) and edges (relations) overlap. We incorporate the
diﬀerent overlaps using weights. A full overlap receives the weight w3 = 1.0, a
two-entity overlap the weight w2 = 0.75, a two-component overlap consisting of
one entity and one relation receives the weights w1 = 0.5, and a one-component
overlap the weight w0 = 0.25. Given two DBs gi and gj , let n3 , n2 , n1 , and
n0 denote the number of elements with a full, a two-entity, a two-component,
and a one-component overlap respectively. We add document di to Ddi ,iso if
                      n
J(gi , gj ) = |gi |+|g   j |−n
                                    > t2 , n = n3 · w3 + n2 · w2 + n1 · w1 + n0 · w0 . For gj
to have data to add to gi , gj ⊃ gi has to hold. Using the Jaccard index allows
an identiﬁcation of a set of documents in Dd,iso which can enrich the DB of
document d.

3.2    Database Enrichment
In the previous section, we have shown how to select documents Dd potentially
enriching the DB g of d. This subsection explains how to enrich g using the
DBs Gd belonging to the matching documents Dd and focussing on both, high
precision and high recall, comparing the document-speciﬁc DBs against the DBs
built by human text annotators.
    Potential associative data for DB g of document d is the data in Gd that
is not already in g. For adding associative data, we are interested in Gd \ {g}.
Unfortunately, Gd \ {g} may be very large if Dd is large. Simply adding all data
from Gd not in g extends g to a large DB containing data not closely related
to d. Consequently, the precision would be very low. Circumventing additions
of irrelevant data to g requires ﬁltering techniques. The following steps ﬁlter
data in Gd : (i) Build a set G�d of candidate data. Given Gd , candidate data edj
of document dj is a relation r between two entities u and v, where relation r
appears in at least n DBs of Gd or the text segment related to l of edj has a
high similarity to the text segments of l� from edi .
    (ii) Rank the elements from G�d according to their relevance factor for d and
add the elements with high relevance factor to g.
    The relevance factor is deﬁned by:

                  RF (edj ) = (1 − H(θdi , θdj )) · J(gi , gj ) · f (t(Gd )),

where H(θdi , θdj ) is the Hellinger Distance, J(gi , gj ) the Jaccard index, and
f (t(Gd )) the frequency of a relation between relations in the domain of two
related documents di and dj .


                                           27
Proceedings of the KI 2017 Workshop on Formal and Cognitive Reasoning


4     Case Study

This section presents a short case study of UTA linking each document to a spe-
ciﬁc DB and ﬁlling the database with entities and relations directly extractable
from the document as well as associative data from related documents. The case
study portrays an iterative DB construction. The goal is to identify entities and
relations for a Wikipedia article without extracting entities and relations directly
from the article itself.
    uta is a Java program implementing the approach of UTA using the library
MALLET [8] for topic modeling with the following parameters: (i) α = 0.01, (ii) β =
0.01, (iii) topics k = 100, and (iv) number of iterations for the model in MALLET
= 1000.
    uta selects documents from D and adds them to Dd having a similarity
in topic proportions to document d of H(θd , θdj ) < t1 , where t1 is 0.8. The
thresholds t2 for the Jaccard Index J(G, Gj ) is set to 0.2.


4.1   Document Corpus Preparation

uta uses data from DBpedia [4] storing structured information as RDF triples
and link them to each article in Wikipedia with a set of RDF triples. The triples
from DBpedia serve as the ground truth.
     The test corpus D contains 100 Wikipedia articles where each article has
a link to the corresponding empty DB. The articles are about the automotive
domain with a connection to the German car brand BMW. uta extracts the text
of the Wikipedia articles and uses a parser to exclude HTML tags. Then, uta
creates for each article an object containing the text of the article.
     Having the corpus D, containing all documents, we use MALLET to pre-process
text documents by (i) lowercasing all characters, (ii) tokenizing the result, and
(iii) eliminating tokens part of a stop-word list which contains 524 words. After
pre-processing, uta randomly gets a document d from the corpus and generates
a topic model for the remaining documents D \ {d}. uta estimates the topic
proportions for each document using MALLET. Afterwards, uta links the text
with the topic proportions.


4.2   Iterative DB Construction

uta performs the following steps to iteratively ﬁll the document-speciﬁc DB g of
document d: (i) Estimate d’s topic proportion θd , (ii) identify documents which
are similar to d using the Hellinger distance and add them to Dd , (iii) identify
similar documents by searching for isomorph subgraphs of g in G \ {g}. After
applying IE techniques to extract extractable data, uta extends the DB of each
document in the iterative fashion by adding new data from other DBs. Asso-
ciative data might be a suitable symbolic content description but we evaluate
uta’s performance against the annotations of a human annotator who often adds
diﬀerent data to a document’s DB.


                                      28
Proceedings of the KI 2017 Workshop on Formal and Cognitive Reasoning


                 0.6                                            0.3


                 0.4                                            0.2
     Precision


                                                       Recall
                 0.2                                            0.1


                  0                                              0
                       0   5            10   15                       0   5            10   15
                               Iterations                                     Iterations


Fig. 1: Iterative KB construction of all 100 documents of the corpus D. The left
plot presents the precision of uta. Right plot presents the recall.


    We perform for all 100 documents the iterative KB construction and present
the precision and recall as Box plots in Figure 1. For most documents, the IE
process was unable to extract all entities and relations. For some documents,
the IE process was unable to extract any entity. Hence, the lower bound in the
Box plots is 0 in Figure 1. For some documents, we have a precision of 0.6 and
a recall of 0.29. In average, the precision of uta is 0.24 and the recall 0.11. uta
reaches a ﬁxed point at the latest after 15 iterations.


5   Conclusion

We present an unsupervised text annotation approach which links each docu-
ment to a graph database containing entities and relations representing a content
description. Each database represents data that is directly extractable from the
document itself and data that is iteratively obtained from related documents.
The performance of UTA depends on the number of topics, the two thresholds
for the Hellinger distance and the Jaccard index, and the quality of information
extraction techniques. UTA shows potential in being able to generate a sym-
bolic content description in an unsupervised fashion for documents, but requires
improvements to achieve higher precision and recall.
    UTA can be used from knowledge graph systems to link documents with a
corresponding symbolic content description containing extractable and associa-
tive data.
    We are interested in extending UTA to an approach taking an arbitrary
query as input and responding with a set of documents answering the query.
This approach requires the linking of documents and the corresponding symbolic
document description.


                                                  29
Proceedings of the KI 2017 Workshop on Formal and Cognitive Reasoning


References
[1]   Gabor Angeli, Melvin Johnson Premkumar, and Christopher D Manning.
      “Leveraging linguistic structure for open domain information extraction”.
      In: Proceedings of the 53rd Annual Meeting of the Association for Compu-
      tational Linguistics (ACL 2015). 2015.
[2]   Ce Zhang. “DeepDive: a data management system for automatic knowledge
      base construction”. PhD thesis. The University of Wisconsin-Madison, 2015.
[3]   Dong, Xin Luna and Gabrilovich, Evgeniy and Heitz, Geremy and Horn,
      Wilko and Murphy, Kevin and Sun, Shaohua and Zhang, Wei. “From data
      fusion to knowledge fusion”. In: Proceedings of the VLDB Endowment 7.10
      (2014), pp. 881–892.
[4]   Jens Lehmann and Robert Isele and Max Jakob and Anja Jentzsch and Dim-
      itris Kontokostas and Pablo Mendes and Sebastian Hellmann and Mohamed
      Morsey and Patrick van Kleef and Sören Auer and Chris Bizer. “DBpedia
      - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia”.
      In: Semantic Web Journal (2014).
[5]   Carlson, Andrew and Betteridge, Justin and Kisiel, Bryan and Settles, Burr
      and Hruschka Jr, Estevam R and Mitchell, Tom M. “Toward an Architecture
      for Never-Ending Language Learning.” In: AAAI. Vol. 5. 2010, p. 3.
[6]   Suchanek, Fabian M and Kasneci, Gjergji and Weikum, Gerhard. “Yago:
      a core of semantic knowledge”. In: Proceedings of the 16th international
      conference on World Wide Web. ACM. 2007, pp. 697–706.
[7]   David M Blei, Andrew Y Ng, and Michael I Jordan. “Latent dirichlet al-
      location”. In: Journal of machine Learning research 3.Jan (2003), pp. 993–
      1022.
[8]   Andrew Kachites McCallum. “MALLET: A Machine Learning for Language
      Toolkit”. http://mallet.cs.umass.edu. 2002.
[9]   Ernst Hellinger. “Neue Begründung der Theorie quadratischer Formen von
      unendlichvielen Veränderlichen.” In: Journal für die reine und angewandte
      Mathematik 136 (1909), pp. 210–271.


                                     30