=Paper=
{{Paper
|id=None
|storemode=property
|title=Approximate subgraph matching for detection of topic variations
|pdfUrl=https://ceur-ws.org/Vol-762/paper5.pdf
|volume=Vol-762
}}
==Approximate subgraph matching for detection of topic variations==
Approximate subgraph matching
for detection of topic variations
Mitja Trampuš Dunja Mladenić
Jozef Stefan Institute Jozef Stefan Institute
Jamova 39 Jamova 39
Ljubljana, Slovenia Ljubljana, Slovenia
mitja.trampus@ijs.si dunja.mladenic@ijs.si
ABSTRACT we are providing structure to the articles as entities
The paper presents an approach to detection of topic varia- from potentially many articles get mapped to a single
tions based on approximate graph matching. Text items are semantic “slot” in a single template.
represented as semantic graphs and approximately matched 2. By (possibly statistically) inspecting all the articles
based on a taxonomy of node and edge labels. Best-matching that were mapped to a chosen template, we can ob-
subgraphs are used as a template against which to align and serve the diversity of articles in a specific aspect, ex-
compare the articles. The proposed approach is applied on ploiting the fact that they are semantically aligned to
news stories using WordNet as the predefined taxonomy. Il- an extent. For example, if the template contains the
lustrative experiments on real-world data show that the ap- statement happen_at..location, no further process-
proach is promising. ing is required to find the specific locations which the
template-mapped articles describe.
Categories and Subject Descriptors
To determine entities and relations that subsume those from
I.2.8 [Artificial intelligence]: Search—Graph and tree search individual news articles and thus construct a template, we
strategies; I.5.4 [Computing methodologies]: Applica- make use of a general-purpose ontology, in our case Word-
tions—Text processing Net. To represent the templates as well as individual news
stories, we use semantic graphs, i.e. graphs in which entities
1. INTRODUCTION represented as nodes and the (binary) relationships connect-
One of the classic goals of text mining is to structure nat- ing them are represented as edges.
ural language text – for obvious reasons: the amount of A sample of the patterns we obtain can be seen in Figure 2.
information we can extract from the data using shallow ap- For example, analyzing a collection of articles describing var-
proaches like bag-of-words is limited. By enhancing text ious bombing attacks, the pattern in the first line emerges: a
with structure, we can start to observe information that is person was killed on a weekday; that same person was killed
encoded in more than one word or sentence. Also, struc- in an attack which took place. The concrete instantiations
ture enables us to bring the additional power of semantic of person and weekday vary across articles from which the
methods and background knowledge into the play. pattern was derived.
While reasonably reliable methods have been developed
for structuring text by annotating and identifying a specific Domain specifics.
subset of information, mostly named entities, little work has Note that media companies have a considerable interest in
been done on semantically capturing the macro-level aspects semantically annotating text, particularly news items. For
of the text. In this article, we present some early work on this reason, and because of easy availability of datasets, we
constructing domain templates, a generic “summary” that focus on the domain of newswire in this paper. Despite
fits many pieces of text on a specific theme (e.g. news stories this, there is in principle nothing specific in this domain
about bombings) at the same time. that would limit the applicability of our method to it. In
The genericness of the template provides for data explo- general, the required input data is a collection of text items
ration in two ways: which are assumed to discuss roughly the same aspects of a
single topic. Examples of such collections are “news articles
1. By automatically mapping specific facts and entities about earthquakes”, “Wikipedia articles on football players”
in an article to the more general ones in a template, or “microwave product reviews”.
2. RELATED WORK
Because it aligns articles to a common template, our method
has much in common with other information extraction mech-
anisms. Automatic construction of information extraction
templates is already relatively well-researched. Most meth-
ods aim for attribute extraction, where the goal is to extract
DiversiWeb 2011. Hyderabad, India. a single predefined type of information, e.g. the title of a
. book. Each separate type of information requires a separate
classifier and training data. Examples of such approaches subject-verb-object triplets. We then use the web service
are [1, 2]. by Rusu [8] to perform coreference and pronoun resolutions
More recently, a generalized problem of relation extraction (”Mr. Obama”, ”President Barack Obama” and ”he” might
has received considerable attention. The goal is to find pairs all refer to the same entity within an article).
of items related by a predefined relation. As an example, We acknowledge that the triplets acquired in this way do
Probst et al. [7] mine product descriptions to simultane- not necessarily provide a proper semantic description of the
ously identify product attributes and their values. Relation article data. The discrepancies go both ways:
extraction is particularly popular in biomedicine where pairs
of proteins in a certain relation (e.g. one inhibits the other) • We include some triplets which do not make sense se-
are often of interest. mantically, e.g. “people..kill..Monday” coming from
The task in this article is more generalized still; we at- “93 people were killed on Monday”.
tempt to decide both what information is interesting to ex-
• We fail to create triplets for information not encoded
tract as well as perform the extraction. This is known as do-
with (lexicogrammatically) transitive verbs. For ex-
main template extraction. To our knowledge, little work has
ample, ”President’s visit to China ...” will not spawn
been done in the area so far. The most closely related work
“president..visit..China”. In our experiments, this
is by Filatova et al. [4], who find templates by mining fre-
shortcoming is alleviated by using redundant informa-
quent parse subtrees. Also closely related is work by Li et al.
tion - each story, e.g. president’s visit to China, is
[6]; similarly to Filatova, they mine frequent parse subtrees
described by several articles which increases the proba-
but then cluster them into “aspects” with a novel graphical
bility that at least one will convey this information in a
model. Both approaches produce syntax-level patterns. Un-
form we can detect. However, the problem is not com-
like ours, neither of the two approaches exploits background
pletely overcome this way – some information e.g. the
knowledge. Also belonging to this group is our previous
“93” in “93 people were killed on Monday” will never
work [10] which mostly shares the goal and data representa-
appear as the object of a transitive verb.
tion ideas with this article, but uses different methods apart
from preprocessing. As a last step, we align all triplets to WordNet; that
Graph-based templates are also used in [9] in a context is, for each subject, verb and object appearing in any of
similar to ours, though the semantics are shallower. Also, the triplets, we try to find the corresponding concept (or
the authors focus on information extraction and do not at- ”synset”, as they are called) in WordNet. We first remove
tempt to generalize the templates. inflection from the words using python NLTK (Natural Lan-
Templates somewhat similar to those we aim to construct guage Toolkit), then align it to the corresponding synset. If
automatically and with no knowledge of the domain have al- more than one synset matches, we choose the most common
ready been created manually by domain experts. FrameNet sense which is a well-tried and surprisingly good strategy.
[?] is a collection of templates for the events like ”disclosing For words not found in WordNet, we create a new synset
a secret”, ”speaking”, ”killing”, ”arresting” etc. They focus on the fly. If the new word (e.g. “Obama”) was previously
mostly on low-level events, of which typically many can be tagged by ANNIE (with e.g. “person”), the new synset’s
found in a single document, be it a news article or not. The hypernym is set accordingly.
project does not concern itself with the creation of the tem-
plates, other than from the methodological point of view. 3.2 Semantic Graph Construction
There is little support for automatic annotation of natural From a collection of triplets, we proceed to construct the
language with the FrameNet frames. semantic graph. Here, we rely rather heavily on the fact
that news articles tend to be focused in scope: we do not
3. METHOD OVERVIEW disambiguate entities other than by name (not necessarily
This section describes the various stages in our data pro- a proper name; e.g. “book” is also a name). As an exam-
cessing pipeline. The assumed input data is, as discussed ple, if an article mentions two buildings, one of which burns
above, a collection of text items on the same topic. The down and the second of which has a green roof, our method
goal is to identify a pattern which semantically matches a detects a single “building” and assigns both properties to it.
substantial number of the input texts. In the newswire domain, we have not found this to be a sig-
The key idea is rather simple: we first represent our input nificant issue: entities which do need to be disambiguated
data as semantic graphs, i.e. graphs of ontology-aligned en- are presented with more unique names (“France” instead of
tities and relations. A pattern is then defined as a (smaller) “country” etc.). This rationale would have to be revised if
graph such that, by specializing some of its entities, a sub- one wanted to apply the approach to longer texts.
graph of at least θ input graphs (θ being a parameter). We This assumption greatly simplifies the construction of the
seek to identify all such patterns. semantic graph: we start by treating each triplet as a 2-
We proceed to describe our approach to the construction node component of a single very fragmented graph and then
of semantic graphs and to the mining of approximate sub- collapse the nodes with the same labels.
graphs.
Dataset specifics.
3.1 Data Preprocessing In our experiments, each input “document” in the sense
Starting with plain text, we first annotate it with some described here was in fact a concatenation of actual docu-
basic semantic and linguistic information. Using the AN- ments, all of which were reporting on the exact same news
NIE tool from the GATE framework, we first detect named event. Section 4 contains the details and rationale.
entities and tag them as person, location or organization.
Following that, we use the Stanford parser [5] to extract 3.3 Approximate Pattern Detection
1) assasin-blow_up→president robber-murder→officer
Given a collection of labeled graphs, we now wish to iden-
tify frequent “approximate subgraphs”, i.e. patterns as de- 2) person-kill→person person-kill→person
scribed at the beginning of Section 3. 3) criminal-kill→person
Formal task definition: Given a set of labeled graphs
S = {G1 , . . . , Gn }, a transitive antisymmetric relation on
graph labels genl(·, ·) (with genl(A, B) interpreted as “label Figure 1: Generalization of input graphs and re-
A is a generalization of label B”) and a number θ, we wish specialization of the pattern.
to construct all maximal graphs H that are approximate
subgraphs of at least θ graphs from S. A graph H is said to
be an approximate subgraph of G iff there is a mapping f of put graphs and to scalability – all existing open-source soft-
V (H) onto a subset of V (G) such that genl(v, f (v)) holds ware crashed on our full input data.
for all v ∈ V (H).
This is not an easy task. Mining frequent subgraphs is in 4. PRELIMINARY EXPERIMENTS
itself computationally demanding because of isomorphisms; AND RESULTS
satisfactorily fast algorithms for this seemingly basic prob-
As a preliminary, let us define some terminology suitable
lem are relatively recent [11]. By further requiring that the
for our experiment domain. An article is a single web page
frequent subgraph only match the input graphs in a soft
which is assumed to report on a single story. A story is
way implied by a taxonomy (here WordNet hyperymy), the
an event that is covered by one or more articles. Each story
complexity becomes insurmountable. We compensate by in-
may fit some domain template (also event template or simply
troducing two assumptions.
template) describing a certain type of event.
1. The hierarchy imposed by genl has a tree-like form, it We obtained a month’s worth of articles from Google
is not a general DAG. This is true of WordNet: every News by crawling. Each article was cleaned of all HTML
synset has at most one hypernym defined. markup, advertisements, navigation and similar. Articles
were grouped into stories according to Google News.
2. Very generic patterns are not interesting and can (or For each story, a semantic graph was constructed. The
even should) be skipped. This too is a safe assump- reason to use an aggregate story graph rather than individ-
tion in our scenario: a pattern in which every node is ual article graphs was twofold. First, by representing each
labeled with the most generic label entity is hardly story as a single graph, all stories were represented equiva-
informative regardless of its graph structure. lently (as opposed to the case where each article contributed
a graph, resulting in stories being weighted proportionally
We can now employ a simple but effective three-stage
to the number of their articles). Second, the method for
search. The stages are illustrated in 1 with the minimal
extracting triplets has relatively low precision and recall; it
example of two two-node graphs.
therefore makes sense to employ the redundancy inherent
1. Generalize all the labels of input graphs to the maxi- in the collection of articles reporting on the same event. To
mum extent permissible. Under the first assumption, construct the aggregate story graph, we simply concatenated
“generalizing a label” is a well-defined operation. The the plain text of individual articles; aggregation at this early
exact meaning of “maximum extent permissible” is gov- stage has the added benefit of providing cross-article entity
erned by the second assumption; no label should be resolution. Finally, the collection of semantic graphs from
generalized so much as to fall in the uninteresting cat- stories on a single topic was input to the pattern mining
egory. In our experience with WordNet, the following algorithm.
simple rule worked very well: generalize verbs as much We defined five topics on which to observe the behavior
as possible and generalize nouns to two levels below of the method: bomb attacks, award ceremonies, worker
the hierarchy root. See steps 1 to 2 in Fig. 1. layoffs, political visits and court sentencings. For each, we
identified about 10 stories of interest. Note that each story
2. Mine θ-frequent maximal subgraphs with support of further comprises about 100 articles, clustering courtesy of
the generalized input graphs. This step cannot be Google News; in total, about 5000 articles were therefore
shown in Fig. 1 as the graphs are too small. processed.
3. Formally, the resulting subgraphs already satisfy our As semantic graphs were constructed on the level of stories
demands. However, to make them as descriptive as rather than articles, their structure was relatively rich. They
possible, we try to specialize the pattern’s labels, tak- had about 1000 nodes each and an average node degree of
ing care not to perform a specialization that would roughly 2.5. The 20% most connected nodes, which are also
reduce the pattern’s support below θ. Specialization, the ones likely to appear in the patterns, had an average
unlike generalization, is not a uniquely defined oper- degree of about 20.
ation (a synset can have many hyponyms), but with For each topic, graphs of all its stories were input to the
some we can afford to recursively explore the whole algorithm. The minimal pattern support was set at 30%
space of possible specializations. We use the sum of for all the topics. The algorithm output several patterns for
labels’ depth in the WordNet hierarchy as a measure each topic; the sizes of the outputs along with the interesting
of pattern descriptiveness that we optimize. See steps patterns are presented in Figure 2.
2 to 3 in Fig. 1. For instance, the last person in the “visits” domain shows
that in at least 30% of the stories, there was a male person
For frequent subgraph mining, we developed our own al- (“he”, e.g. Obama) who traveled to France (a coincidence),
gorithm, inspired by the current state-of-art[11, 3]. We in- and that same person met a “leader” (a president in some of
cluded some improvements pertaining to maximality of out- the stories, a minister in other).
Bombing attacks (8 patterns in total)
weekday ←kill- person -kill→ attack -take→ place
Note that in current implementation, all final patterns
himself ←have- suicide bomber -explode→ device
with less than three nodes (e.g. worker..lose..job for
himself ←have- suicide bomber -blow→ building
the “layoffs” topic) were discarded. Partly this is because
we are, in perspective, interested in (dis)proving that struc-
Court sentencings (7 patterns in total)
correctional institution ←be- person -face→ year ←be- sentence tured patterns can provide more information than sentence-
innocent ←be- person -face→ year ←be- sentence level patterns found in related work2 . Partly, however, it is
Award ceremonies (2 patterns in total)
also because including two-node patterns would introduce
period of time ←have- person -be→ feeling additional noise in the output. Even now, the precision is
Political visits (3 patterns in total) relatively low; it would therefore be interesting to investi-
summit ←attend- he -- hold→ talk
||`-be→ leader
gate measures of interestingness of patterns other than raw
|`--tell→ communicator frequency.
`---express → feeling
need ←stress - he - hold→ talk
|`-attend → summit 6. ACKNOWLEDGMENTS
`--be→ leader
This work was supported by the Slovenian Research Agency
leader ←meet- he -travel→ France
and the IST Programme of the EC under PASCAL2 (IST-
Worker layoffs (0 patterns in total)
NoE-216886), ACTIVE (IST-2008-215040) and RENDER
(FP7-257790).
Figure 2: Manually selected best patterns for each
domain.
7. REFERENCES
[1] A. Arasu and H. Garcia-Molina. Extracting structured
5. DISCUSSION AND FUTURE WORK data from web pages. pages 337–348, 2003.
The preliminary results seem sound. The mappings of [2] S. Brin. Extracting patterns and relations from the
individual stories onto the patterns (not given here) also world wide web. Lecture Notes in Computer Science,
provide a semantically correct alignment. We can observe pages 172–183, 1999.
how each story fits the template with slightly different enti- [3] Y. Chi, S. Nijssen, R. Muntz, and J. Kok. Frequent
ties. Sometimes, the variations are almost imperceptible – subtree mining-an overview. Fundamenta
“correctional facility” from the “court” domain, for example, Informaticae, 66(1):161–198, 2005.
appears as either “jail” or “prison”, which for some reason [4] E. Filatova, V. Hatzivassiloglou, and K. McKeown.
are two distinct concepts in WordNet. Automatic creation of domain templates. In
In other cases, the distinctions are significant and express Proceedings of COLING/ACL 2006, pages 207–214,
the subtopical diversity we were looking for. For example, Morristown, NJ, USA, 2006. Association for
the groundings for “leader” in the “visits” domain varied even Computational Linguistics.
in our small dataset over president, minister, instigator or [5] D. Klein and C. Manning. Accurate unlexicalized
simply leader. In the same domain, “feeling” was either sor- parsing. In Proceedings of the 41st Annual Meeting on
row, disappointment or satisfaction. The “building” in the Association for Computational Linguistics, pages
“bombings” domain was generalized from mosque, restau- 423–430. Association for Computational Linguistics,
rant, hotel and building. It might be interesting to investi- 2003.
gate this further and use the amount of variation between [6] P. Li, J. Jiang, and Y. Wang. Generating templates of
pattern groundings as a measure of pattern interestingness. entity summaries with an entity-aspect model and
Unexpectedly, diversity can occasionally be found in the pattern mining. Proceedings of the 48th Annual
natural clustering that the patterns provide. Observe the Meeting of the Association for Computational
two patterns in the “court” domain: in both, the defendant Linguistics, pages 640–649, 2010.
is facing a sentence of (one or more) years, but is found [7] K. Probst, R. Ghani, M. Krema, A. Fano, and Y. Liu.
innocent in one cluster and sent to(?) the jail in the other. Semi-supervised learning of attribute-value pairs from
While the current experiments are too small to draw any product descriptions. pages 2838–2843, 2007.
conclusive evidence, we can make some speculations about [8] D. Rusu, B. Fortuna, M. Grobelnik, and D. Mladenić.
precision and recall. While the first is low but usable (a Semantic Graphs Derived From Triplets With
data analyst should not mind going through e.g. 5 patterns Application In Document Summarization. Informatica
to identify a useful one), the latter seems a bigger issue. We Journal, 2009.
hope to improve the results significantly by developing a bet- [9] H. Tanev and B. Magnini. Weakly supervised
ter triplet extractor1 ; the previously discussed deficiencies of approaches for ontology population. 2006.
current triplets appear to hit performance most.
[10] M. Trampuš and D. Mladenić. Learning event
The tests also indicate that the method is not equally
templated from news articles. In Proceedings of
suitable for all domains. The “layoffs” domain, for example,
SiKDD09, 2009.
had no single pattern which would occur in 30% of the sto-
[11] X. Yan and J. Han. gSpan: Graph-based substructure
ries. (A threshold of 25% produces a single rather nonsensi-
pattern mining. page 721, 2002.
cal pattern “it—cut−→job←−lose—people”). The “awards;;
domain does not fare much better. Most probably, these
two topics are too broad, causing stories to have only little
overlap.
2
The “visits” domain is a nice indication that this may be
1
But this is a new project in itself. true.