=Paper=
{{Paper
|id=None
|storemode=property
|title=Approximate subgraph matching for detection of topic variations
|pdfUrl=https://ceur-ws.org/Vol-762/paper5.pdf
|volume=Vol-762
}}
==Approximate subgraph matching for detection of topic variations==
<pdf width="1500px">https://ceur-ws.org/Vol-762/paper5.pdf</pdf>
<pre>
                                Approximate subgraph matching
                                for detection of topic variations

                             Mitja Trampuš                                         Dunja Mladenić
                          Jozef Stefan Institute                                  Jozef Stefan Institute
                               Jamova 39                                               Jamova 39
                           Ljubljana, Slovenia                                     Ljubljana, Slovenia
                        mitja.trampus@ijs.si                                   dunja.mladenic@ijs.si

ABSTRACT                                                                  we are providing structure to the articles as entities
The paper presents an approach to detection of topic varia-               from potentially many articles get mapped to a single
tions based on approximate graph matching. Text items are                 semantic “slot” in a single template.
represented as semantic graphs and approximately matched               2. By (possibly statistically) inspecting all the articles
based on a taxonomy of node and edge labels. Best-matching                that were mapped to a chosen template, we can ob-
subgraphs are used as a template against which to align and               serve the diversity of articles in a specific aspect, ex-
compare the articles. The proposed approach is applied on                 ploiting the fact that they are semantically aligned to
news stories using WordNet as the predefined taxonomy. Il-                an extent. For example, if the template contains the
lustrative experiments on real-world data show that the ap-               statement happen_at..location, no further process-
proach is promising.                                                      ing is required to find the specific locations which the
                                                                          template-mapped articles describe.
Categories and Subject Descriptors
                                                                  To determine entities and relations that subsume those from
I.2.8 [Artificial intelligence]: Search—Graph and tree search     individual news articles and thus construct a template, we
strategies; I.5.4 [Computing methodologies]: Applica-             make use of a general-purpose ontology, in our case Word-
tions—Text processing                                             Net. To represent the templates as well as individual news
                                                                  stories, we use semantic graphs, i.e. graphs in which entities
1.     INTRODUCTION                                               represented as nodes and the (binary) relationships connect-
   One of the classic goals of text mining is to structure nat-   ing them are represented as edges.
ural language text – for obvious reasons: the amount of             A sample of the patterns we obtain can be seen in Figure 2.
information we can extract from the data using shallow ap-        For example, analyzing a collection of articles describing var-
proaches like bag-of-words is limited. By enhancing text          ious bombing attacks, the pattern in the first line emerges: a
with structure, we can start to observe information that is       person was killed on a weekday; that same person was killed
encoded in more than one word or sentence. Also, struc-           in an attack which took place. The concrete instantiations
ture enables us to bring the additional power of semantic         of person and weekday vary across articles from which the
methods and background knowledge into the play.                   pattern was derived.
   While reasonably reliable methods have been developed
for structuring text by annotating and identifying a specific     Domain specifics.
subset of information, mostly named entities, little work has        Note that media companies have a considerable interest in
been done on semantically capturing the macro-level aspects       semantically annotating text, particularly news items. For
of the text. In this article, we present some early work on       this reason, and because of easy availability of datasets, we
constructing domain templates, a generic “summary” that           focus on the domain of newswire in this paper. Despite
fits many pieces of text on a specific theme (e.g. news stories   this, there is in principle nothing specific in this domain
about bombings) at the same time.                                 that would limit the applicability of our method to it. In
   The genericness of the template provides for data explo-       general, the required input data is a collection of text items
ration in two ways:                                               which are assumed to discuss roughly the same aspects of a
                                                                  single topic. Examples of such collections are “news articles
     1. By automatically mapping specific facts and entities      about earthquakes”, “Wikipedia articles on football players”
        in an article to the more general ones in a template,     or “microwave product reviews”.

                                                                  2.     RELATED WORK
                                                                    Because it aligns articles to a common template, our method
                                                                  has much in common with other information extraction mech-
                                                                  anisms. Automatic construction of information extraction
                                                                  templates is already relatively well-researched. Most meth-
                                                                  ods aim for attribute extraction, where the goal is to extract
DiversiWeb 2011. Hyderabad, India.                                a single predefined type of information, e.g. the title of a
.                                                                 book. Each separate type of information requires a separate
classifier and training data. Examples of such approaches          subject-verb-object triplets. We then use the web service
are [1, 2].                                                        by Rusu [8] to perform coreference and pronoun resolutions
   More recently, a generalized problem of relation extraction     (”Mr. Obama”, ”President Barack Obama” and ”he” might
has received considerable attention. The goal is to find pairs     all refer to the same entity within an article).
of items related by a predefined relation. As an example,             We acknowledge that the triplets acquired in this way do
Probst et al. [7] mine product descriptions to simultane-          not necessarily provide a proper semantic description of the
ously identify product attributes and their values. Relation       article data. The discrepancies go both ways:
extraction is particularly popular in biomedicine where pairs
of proteins in a certain relation (e.g. one inhibits the other)       • We include some triplets which do not make sense se-
are often of interest.                                                  mantically, e.g. “people..kill..Monday” coming from
   The task in this article is more generalized still; we at-           “93 people were killed on Monday”.
tempt to decide both what information is interesting to ex-
                                                                      • We fail to create triplets for information not encoded
tract as well as perform the extraction. This is known as do-
                                                                        with (lexicogrammatically) transitive verbs. For ex-
main template extraction. To our knowledge, little work has
                                                                        ample, ”President’s visit to China ...” will not spawn
been done in the area so far. The most closely related work
                                                                        “president..visit..China”. In our experiments, this
is by Filatova et al. [4], who find templates by mining fre-
                                                                        shortcoming is alleviated by using redundant informa-
quent parse subtrees. Also closely related is work by Li et al.
                                                                        tion - each story, e.g. president’s visit to China, is
[6]; similarly to Filatova, they mine frequent parse subtrees
                                                                        described by several articles which increases the proba-
but then cluster them into “aspects” with a novel graphical
                                                                        bility that at least one will convey this information in a
model. Both approaches produce syntax-level patterns. Un-
                                                                        form we can detect. However, the problem is not com-
like ours, neither of the two approaches exploits background
                                                                        pletely overcome this way – some information e.g. the
knowledge. Also belonging to this group is our previous
                                                                        “93” in “93 people were killed on Monday” will never
work [10] which mostly shares the goal and data representa-
                                                                        appear as the object of a transitive verb.
tion ideas with this article, but uses different methods apart
from preprocessing.                                                   As a last step, we align all triplets to WordNet; that
   Graph-based templates are also used in [9] in a context         is, for each subject, verb and object appearing in any of
similar to ours, though the semantics are shallower. Also,         the triplets, we try to find the corresponding concept (or
the authors focus on information extraction and do not at-         ”synset”, as they are called) in WordNet. We first remove
tempt to generalize the templates.                                 inflection from the words using python NLTK (Natural Lan-
   Templates somewhat similar to those we aim to construct         guage Toolkit), then align it to the corresponding synset. If
automatically and with no knowledge of the domain have al-         more than one synset matches, we choose the most common
ready been created manually by domain experts. FrameNet            sense which is a well-tried and surprisingly good strategy.
[?] is a collection of templates for the events like ”disclosing   For words not found in WordNet, we create a new synset
a secret”, ”speaking”, ”killing”, ”arresting” etc. They focus      on the fly. If the new word (e.g. “Obama”) was previously
mostly on low-level events, of which typically many can be         tagged by ANNIE (with e.g. “person”), the new synset’s
found in a single document, be it a news article or not. The       hypernym is set accordingly.
project does not concern itself with the creation of the tem-
plates, other than from the methodological point of view.          3.2    Semantic Graph Construction
There is little support for automatic annotation of natural           From a collection of triplets, we proceed to construct the
language with the FrameNet frames.                                 semantic graph. Here, we rely rather heavily on the fact
                                                                   that news articles tend to be focused in scope: we do not
3.    METHOD OVERVIEW                                              disambiguate entities other than by name (not necessarily
   This section describes the various stages in our data pro-      a proper name; e.g. “book” is also a name). As an exam-
cessing pipeline. The assumed input data is, as discussed          ple, if an article mentions two buildings, one of which burns
above, a collection of text items on the same topic. The           down and the second of which has a green roof, our method
goal is to identify a pattern which semantically matches a         detects a single “building” and assigns both properties to it.
substantial number of the input texts.                             In the newswire domain, we have not found this to be a sig-
   The key idea is rather simple: we first represent our input     nificant issue: entities which do need to be disambiguated
data as semantic graphs, i.e. graphs of ontology-aligned en-       are presented with more unique names (“France” instead of
tities and relations. A pattern is then defined as a (smaller)     “country” etc.). This rationale would have to be revised if
graph such that, by specializing some of its entities, a sub-      one wanted to apply the approach to longer texts.
graph of at least θ input graphs (θ being a parameter). We            This assumption greatly simplifies the construction of the
seek to identify all such patterns.                                semantic graph: we start by treating each triplet as a 2-
   We proceed to describe our approach to the construction         node component of a single very fragmented graph and then
of semantic graphs and to the mining of approximate sub-           collapse the nodes with the same labels.
graphs.
                                                                   Dataset specifics.
3.1    Data Preprocessing                                            In our experiments, each input “document” in the sense
  Starting with plain text, we first annotate it with some         described here was in fact a concatenation of actual docu-
basic semantic and linguistic information. Using the AN-           ments, all of which were reporting on the exact same news
NIE tool from the GATE framework, we first detect named            event. Section 4 contains the details and rationale.
entities and tag them as person, location or organization.
Following that, we use the Stanford parser [5] to extract          3.3    Approximate Pattern Detection
                                                                      1) assasin-blow_up→president    robber-murder→officer
   Given a collection of labeled graphs, we now wish to iden-
tify frequent “approximate subgraphs”, i.e. patterns as de-           2)   person-kill→person          person-kill→person
scribed at the beginning of Section 3.                                3)              criminal-kill→person
   Formal task definition: Given a set of labeled graphs
S = {G1 , . . . , Gn }, a transitive antisymmetric relation on
graph labels genl(·, ·) (with genl(A, B) interpreted as “label   Figure 1: Generalization of input graphs and re-
A is a generalization of label B”) and a number θ, we wish       specialization of the pattern.
to construct all maximal graphs H that are approximate
subgraphs of at least θ graphs from S. A graph H is said to
be an approximate subgraph of G iff there is a mapping f of      put graphs and to scalability – all existing open-source soft-
V (H) onto a subset of V (G) such that genl(v, f (v)) holds      ware crashed on our full input data.
for all v ∈ V (H).
   This is not an easy task. Mining frequent subgraphs is in     4.    PRELIMINARY EXPERIMENTS
itself computationally demanding because of isomorphisms;              AND RESULTS
satisfactorily fast algorithms for this seemingly basic prob-
                                                                   As a preliminary, let us define some terminology suitable
lem are relatively recent [11]. By further requiring that the
                                                                 for our experiment domain. An article is a single web page
frequent subgraph only match the input graphs in a soft
                                                                 which is assumed to report on a single story. A story is
way implied by a taxonomy (here WordNet hyperymy), the
                                                                 an event that is covered by one or more articles. Each story
complexity becomes insurmountable. We compensate by in-
                                                                 may fit some domain template (also event template or simply
troducing two assumptions.
                                                                 template) describing a certain type of event.
  1. The hierarchy imposed by genl has a tree-like form, it        We obtained a month’s worth of articles from Google
     is not a general DAG. This is true of WordNet: every        News by crawling. Each article was cleaned of all HTML
     synset has at most one hypernym defined.                    markup, advertisements, navigation and similar. Articles
                                                                 were grouped into stories according to Google News.
  2. Very generic patterns are not interesting and can (or         For each story, a semantic graph was constructed. The
     even should) be skipped. This too is a safe assump-         reason to use an aggregate story graph rather than individ-
     tion in our scenario: a pattern in which every node is      ual article graphs was twofold. First, by representing each
     labeled with the most generic label entity is hardly        story as a single graph, all stories were represented equiva-
     informative regardless of its graph structure.              lently (as opposed to the case where each article contributed
                                                                 a graph, resulting in stories being weighted proportionally
  We can now employ a simple but effective three-stage
                                                                 to the number of their articles). Second, the method for
search. The stages are illustrated in 1 with the minimal
                                                                 extracting triplets has relatively low precision and recall; it
example of two two-node graphs.
                                                                 therefore makes sense to employ the redundancy inherent
  1. Generalize all the labels of input graphs to the maxi-      in the collection of articles reporting on the same event. To
     mum extent permissible. Under the first assumption,         construct the aggregate story graph, we simply concatenated
     “generalizing a label” is a well-defined operation. The     the plain text of individual articles; aggregation at this early
     exact meaning of “maximum extent permissible” is gov-       stage has the added benefit of providing cross-article entity
     erned by the second assumption; no label should be          resolution. Finally, the collection of semantic graphs from
     generalized so much as to fall in the uninteresting cat-    stories on a single topic was input to the pattern mining
     egory. In our experience with WordNet, the following        algorithm.
     simple rule worked very well: generalize verbs as much        We defined five topics on which to observe the behavior
     as possible and generalize nouns to two levels below        of the method: bomb attacks, award ceremonies, worker
     the hierarchy root. See steps 1 to 2 in Fig. 1.             layoffs, political visits and court sentencings. For each, we
                                                                 identified about 10 stories of interest. Note that each story
  2. Mine θ-frequent maximal subgraphs with support of           further comprises about 100 articles, clustering courtesy of
     the generalized input graphs. This step cannot be           Google News; in total, about 5000 articles were therefore
     shown in Fig. 1 as the graphs are too small.                processed.
  3. Formally, the resulting subgraphs already satisfy our         As semantic graphs were constructed on the level of stories
     demands. However, to make them as descriptive as            rather than articles, their structure was relatively rich. They
     possible, we try to specialize the pattern’s labels, tak-   had about 1000 nodes each and an average node degree of
     ing care not to perform a specialization that would         roughly 2.5. The 20% most connected nodes, which are also
     reduce the pattern’s support below θ. Specialization,       the ones likely to appear in the patterns, had an average
     unlike generalization, is not a uniquely defined oper-      degree of about 20.
     ation (a synset can have many hyponyms), but with             For each topic, graphs of all its stories were input to the
     some we can afford to recursively explore the whole         algorithm. The minimal pattern support was set at 30%
     space of possible specializations. We use the sum of        for all the topics. The algorithm output several patterns for
     labels’ depth in the WordNet hierarchy as a measure         each topic; the sizes of the outputs along with the interesting
     of pattern descriptiveness that we optimize. See steps      patterns are presented in Figure 2.
     2 to 3 in Fig. 1.                                             For instance, the last person in the “visits” domain shows
                                                                 that in at least 30% of the stories, there was a male person
  For frequent subgraph mining, we developed our own al-         (“he”, e.g. Obama) who traveled to France (a coincidence),
gorithm, inspired by the current state-of-art[11, 3]. We in-     and that same person met a “leader” (a president in some of
cluded some improvements pertaining to maximality of out-        the stories, a minister in other).
     Bombing attacks (8 patterns in total)
     weekday ←kill- person -kill→                 attack     -take→   place
                                                                                                   Note that in current implementation, all final patterns
     himself      ←have-      suicide bomber     -explode→       device
                                                                                                with less than three nodes (e.g. worker..lose..job for
     himself      ←have-      suicide bomber     -blow→      building
                                                                                                the “layoffs” topic) were discarded. Partly this is because
                                                                                                we are, in perspective, interested in (dis)proving that struc-
     Court sentencings (7 patterns in total)
     correctional institution ←be-               person     -face→    year    ←be-   sentence   tured patterns can provide more information than sentence-
     innocent      ←be-     person      -face→   year     ←be-   sentence                       level patterns found in related work2 . Partly, however, it is
     Award ceremonies (2 patterns in total)
                                                                                                also because including two-node patterns would introduce
     period of time ←have- person                -be→     feeling                               additional noise in the output. Even now, the precision is
     Political visits (3 patterns in total)                                                     relatively low; it would therefore be interesting to investi-
     summit ←attend- he -- hold→ talk
                                    ||`-be→ leader
                                                                                                gate measures of interestingness of patterns other than raw
                                    |`--tell→ communicator                                      frequency.
                                    `---express → feeling

     need     ←stress -       he - hold→ talk
                                |`-attend → summit                                              6.   ACKNOWLEDGMENTS
                                `--be→ leader
                                                                                                  This work was supported by the Slovenian Research Agency
     leader     ←meet-      he    -travel→    France
                                                                                                and the IST Programme of the EC under PASCAL2 (IST-
     Worker layoffs (0 patterns in total)
                                                                                                NoE-216886), ACTIVE (IST-2008-215040) and RENDER
                                                                                                (FP7-257790).
Figure 2: Manually selected best patterns for each
domain.
                                                                                                7.   REFERENCES
                                                                                                 [1] A. Arasu and H. Garcia-Molina. Extracting structured
5.       DISCUSSION AND FUTURE WORK                                                                  data from web pages. pages 337–348, 2003.
   The preliminary results seem sound. The mappings of                                           [2] S. Brin. Extracting patterns and relations from the
individual stories onto the patterns (not given here) also                                           world wide web. Lecture Notes in Computer Science,
provide a semantically correct alignment. We can observe                                             pages 172–183, 1999.
how each story fits the template with slightly different enti-                                   [3] Y. Chi, S. Nijssen, R. Muntz, and J. Kok. Frequent
ties. Sometimes, the variations are almost imperceptible –                                           subtree mining-an overview. Fundamenta
“correctional facility” from the “court” domain, for example,                                        Informaticae, 66(1):161–198, 2005.
appears as either “jail” or “prison”, which for some reason                                      [4] E. Filatova, V. Hatzivassiloglou, and K. McKeown.
are two distinct concepts in WordNet.                                                                Automatic creation of domain templates. In
   In other cases, the distinctions are significant and express                                      Proceedings of COLING/ACL 2006, pages 207–214,
the subtopical diversity we were looking for. For example,                                           Morristown, NJ, USA, 2006. Association for
the groundings for “leader” in the “visits” domain varied even                                       Computational Linguistics.
in our small dataset over president, minister, instigator or                                     [5] D. Klein and C. Manning. Accurate unlexicalized
simply leader. In the same domain, “feeling” was either sor-                                         parsing. In Proceedings of the 41st Annual Meeting on
row, disappointment or satisfaction. The “building” in the                                           Association for Computational Linguistics, pages
“bombings” domain was generalized from mosque, restau-                                               423–430. Association for Computational Linguistics,
rant, hotel and building. It might be interesting to investi-                                        2003.
gate this further and use the amount of variation between                                        [6] P. Li, J. Jiang, and Y. Wang. Generating templates of
pattern groundings as a measure of pattern interestingness.                                          entity summaries with an entity-aspect model and
   Unexpectedly, diversity can occasionally be found in the                                          pattern mining. Proceedings of the 48th Annual
natural clustering that the patterns provide. Observe the                                            Meeting of the Association for Computational
two patterns in the “court” domain: in both, the defendant                                           Linguistics, pages 640–649, 2010.
is facing a sentence of (one or more) years, but is found                                        [7] K. Probst, R. Ghani, M. Krema, A. Fano, and Y. Liu.
innocent in one cluster and sent to(?) the jail in the other.                                        Semi-supervised learning of attribute-value pairs from
   While the current experiments are too small to draw any                                           product descriptions. pages 2838–2843, 2007.
conclusive evidence, we can make some speculations about                                         [8] D. Rusu, B. Fortuna, M. Grobelnik, and D. Mladenić.
precision and recall. While the first is low but usable (a                                           Semantic Graphs Derived From Triplets With
data analyst should not mind going through e.g. 5 patterns                                           Application In Document Summarization. Informatica
to identify a useful one), the latter seems a bigger issue. We                                       Journal, 2009.
hope to improve the results significantly by developing a bet-                                   [9] H. Tanev and B. Magnini. Weakly supervised
ter triplet extractor1 ; the previously discussed deficiencies of                                    approaches for ontology population. 2006.
current triplets appear to hit performance most.
                                                                                                [10] M. Trampuš and D. Mladenić. Learning event
   The tests also indicate that the method is not equally
                                                                                                     templated from news articles. In Proceedings of
suitable for all domains. The “layoffs” domain, for example,
                                                                                                     SiKDD09, 2009.
had no single pattern which would occur in 30% of the sto-
                                                                                                [11] X. Yan and J. Han. gSpan: Graph-based substructure
ries. (A threshold of 25% produces a single rather nonsensi-
                                                                                                     pattern mining. page 721, 2002.
cal pattern “it—cut−→job←−lose—people”). The “awards;;
domain does not fare much better. Most probably, these
two topics are too broad, causing stories to have only little
overlap.
                                                                                                2
                                                                                                  The “visits” domain is a nice indication that this may be
1
    But this is a new project in itself.                                                        true.

</pre>