Extracting Semantic Relations
                   for Mining of Social Data

           Shinichi Nagano1 , Masumi Inaba2 , and Takahiro Kawamura1
                1
                  Corporate R&D Center, Toshiba Corporation, Japan
    2
        Platform Solution Business Division, Toshiba Solution Corporation, Japan


        Abstract. This paper proposes a novel method that extracts semantic
        relations from social data in order to acquire ontologies that are used
        for mining social data. A set of nouns are iteratively extracted from
        documents in a bootstrapping manner, and then a semantic relation
        between a noun pair is identiﬁed by a clustering procedure. The main
        feature is exploitation of the co-occurrence of a verb and a noun in a
        sentence, considering that a verb plays an important role in expressing
        the meaning of a sentence. The paper presents a preliminary study to
        clarify problems in order to achieve practical performance.


1   Introduction
The Social Web is communication infrastructure allowing people to argue, collab-
orate, or cooperate with other individuals or communities. Numerous opinions
and experiences are mentioned on the Social Web. Thus, enterprises are eager
to utilize social data for their advertising, marketing, or product planning[10,
17]. Mining social data involves digging up fragments of information scattered
on the Web, and then discovering knowledge[15, 16]. Since semantic technologies
have yet to catch up with the explosive growth in the publishing of data on the
Social Web, mining the social data is still a challenging issue.
    It is well known that ontologies are fundamental resources for semantic tech-
nologies, and their development from social data is being pursued around the
world. Among the projects associated with this work are YAGO[1], DBpedia[3].
They are often provided in a machine readable, reusable, and extensible form,
which makes it possible to develop an ontology for a certain purpose without
building up from scratch. However, local ontologies dependent on their particular
domains are often desired because generic ontologies are insuﬃcient for certain
purposes.
    This paper addresses the issue of semantic relation extraction from doc-
uments on the Social Web. Most of the conventional methods are based on
discovery of lexico-syntactic patterns that are dependency paths indicating a
semantic relation between nouns[5]. Applicability of such methods is limited to
term extraction from sentences matching the patterns and thus the methods
need explore a large amount of documents.
    The paper proposes a novel method that extracts terms in weak semantic
relations from documents in a predetermined domain, yielding a hierarchical
form of the terms. A weak semantic relation means that terms in the relation
might not exist on dependency paths. The main feature is exploitation of the
co-occurrence of a verb and a noun in a sentence, considering that a verb plays
an important role in expressing the meaning of a sentence. For instance, people
often mention, on the Social Web, their daily experiences and opinions, which
are characterized by verb-noun pairs such as where to go, what to eat, and what
to watch. Prompted by this observation, the method identiﬁes nouns appear-
ing together with particular verbs at a high frequency. This is a hybrid method
involving a bootstrapping procedure and a clustering procedure. The bootstrap-
ping procedure is a process that initially extracts patterns from given verb-noun
pairs and then alternately extracts pairs and patterns. Iterative application of
the procedure yields a set of nouns. The clustering procedure identiﬁes weak se-
mantic relations for each pair of nouns obtained by the bootstrapping procedure,
resulting in the hierarchical form of the nouns.
    The remainder of this paper is organized as follows. Related work is discussed
in Section 2. Next, Section 3 illustrates the proposed method. Then, Section 4
presents a preliminary evaluation of the method. Finally, Section 5 gives the
conclusion and refers to future work.


2   Related Work

Many previous work on automatic extraction of terms in semantic relations has
been based on a key insight of Hearst[4]. He noticed that the presence of certain
lexico-syntactic patterns may indicate a particular semantic relation between
two nouns. For instance, linking two noun phrases (NPs) via the constructions
“Such N PY as N PX ” often implies that N PX is a hyponym of N PY . Initially,
a small number of handcrafted patterns like these were used to try to auto-
matically label such semantic relations[7, 8, 11–14]. Following this approach, su-
pervised learning algorithms were devised to obtain a large number of useful
lexico-syntactic patterns. Snow et al.[5] proposed a generic method that formal-
izes lexico-syntactic patterns with dependency paths as features for prediction
of hypernyms. Suchanek et al.[1, 2] applied a supervised learning algorithm to
fact extraction from Wikipedia yielding a common ontology as social semantic
knowledge.
    Generic pattern classiﬁcation is one of the most signiﬁcant issues in semantic
relation extraction. Generic patterns[13] have broad but noisy coverage. Diﬃ-
culties in using these patterns have been a major impediment for supervised
algorithms, resulting in either very low precision or recall. Espresso[9] is a novel
unsupervised method that automatically excludes generic patterns and separates
correct and incorrect noun pairs, giving higher reliance to patterns retrieved from
correct pairs. Saeger et al.[6] proposed a weakly supervised method that clas-
siﬁes generic patterns with the use of semantic word classes acquired through
learning class-dependent patterns.
    Although this previous work has been successful in identifying pairs in hand-
crafted or learned patterns from a large number (typically more than a million)
                  液晶テレビでは／初と／なる／「CELL REGZA」を／発売。

             “CELL REGZA” will be launched as the world’s first LCD television.
                                       (a) Sentence


                       Xでは／初と／なる／「CELL REGZA」を／Y。

                      “CELL REGZA” will be Y as the world’s first X.
                                    (b) Acquired pattern

                       Fig. 1. Example of pattern acquisition


of documents, it is strictly limited to discovery from patterns, i.e., discovery is
only possible in the case of that two nouns exist on a dependency path corre-
sponding to a pattern. Our method is a novel one that identiﬁes noun pairs in
weak semantic relations, clustering nouns obtained by verb-noun patterns, in-
stead of directly acquiring noun pairs from patterns. It is possible to obtain a
noun pair that does not co-occur on a dependency path, and thus our method
is widely applicable.

3     Semantic Relation Extraction
3.1   Problem Definition
A weak semantic relation is deﬁned as a relationship between two nouns that
appear together with particular verbs. As mentioned in Section 1, a verb often
plays an important role in expressing the meaning of a sentence. If two nouns
frequently appear with particular verbs, then it often implies that they are in a
certain semantic relation. For instance, people often mention, on the Social Web,
their daily experiences and opinions, which are characterized by verb-noun pairs
such as where to go, what to eat, and what to watch. Clustering nouns appearing
with eat could yield a hierarchy of nouns representing food names. The deﬁnition
is derived from this observation.
    A pattern is deﬁned in this paper as a lexico-syntactic pattern, which is a set
of dependency paths indicating a semantic relation between a verb and a noun.

3.2   Proposed Method
We outline the proposed method. Given a set of documents in a predetermined
domain and a minimum set of verb-noun pairs in a designated semantic relation,
the method retrieves nouns in the relation from the documents and then yields
the hierarchical form of obtained nouns. Note that the given relation is used
for retrieving verb-noun pairs whereas our objective is to obtain nouns in a
hypernym-hyponym relation.
    The method consists of two procedures: bootstrapping and clustering. The
bootstrapping procedure consists of pattern acquisition and term acquisition
steps. The pattern acquisition step explores the documents to ﬁnd a syntactic
tree in such a way that the given verb-noun pair appears on the tree. Replacing
the verb and noun with two variables, it acquires the tree as the relation extrac-
tion pattern. On the other hand, the term acquisition step ﬁnds the sentence
that has the same syntactic structure as the pattern except for the variables.
Identifying the terms corresponding to the variables, it newly acquires a verb-
noun pair. This procedure performs both steps alternately and iteratively. As a
result, it can acquire nouns in an incremental manner. An example of pattern
acquisition is illustrated in Fig 1.
    Following the bootstrapping procedure, the clustering procedure ﬁnds verbs
co-occurring with each noun of a pair. Let V (N P ) be the set of verbs that
appears with a noun phase N P . For two nouns N P1 and N P2 , it determines
whether or not the pair is in a hypernym-hyponym relation, comparing a subset
relation between V (N P1 ) and V (N P2 ) as follows:

 – If V (N P1 ) ⊂ V (N P2 ) holds, then it is determined that N P1 is a hyponym
   of N P2 (N P2 is a hypernym of N P1 ).
 – If V (N P2 ) ⊂ V (N P1 ) holds, then it is determined that N P2 is a hyponym
   of N P1 (N P1 is a hypernym of N P2 ).
 – Otherwise, it is determined that N P1 is determined to be a synonym of N P2 .


4     Preliminary Evaluation

4.1    Overview of Evaluation

We apply the proposed method to a set of three hundred Web documents, which
are blog posts mentioned on food products. We ﬁrst label pairs of nouns in a
weak semantic relation for the documents, and then conduct random sampling
of the pairs. Our experiment starts at applying the bootstrapping procedure to
a set of the sampled pairs as seed pairs. After ﬁve iteration of the procedure, the
clustering procedure extracts a hypernym-hyponym relation for the food prod-
ucts, which is composed of a product category and a product name. Precision
and recall of acquired term pairs are used as evaluation metrics. For analyzing
Japanese sentences, JUMAN3 and KNP4 are used for a part-of-speech tagger
and a dependency analyzer, respectively. Note that dictionaries for the tagger
and the analyzer are not compiled for processing the input documents.
3
    http://nlp.kuee.kyoto-u.ac.jp/nl-resource/juman-e.html
4
    http://nlp.kuee.kyoto-u.ac.jp/nl-resource/knp-e.html
                              Table 1. Evaluation results

                #Seed pairs        10      20      30      40       50
                Precision       0.806   0.383   0.231   0.288    0.343
                Recall          0.004   0.006   0.009   0.012    0.018


4.2   Evaluation Results

We summarize the evaluation results in Table 1. Each metric value represents an
average over evaluation results for three distinctive sets of seed pairs. The given
seed pairs have an inﬂuence on precision of the proposed method. Indeed, as
shown in Table 1, precision value is 0.806 in case that the number of seed pairs
is 10, and it is within a range of 0.2 to 0.4 in other cases. Investigating term
pairs incorrectly acquired, we found that most of the pairs were extracted by
generic patterns, which are high-frequency ones. Since a generic pattern could
express diﬀerent kinds of semantic relations, it should be classiﬁed into an ade-
quate pattern class according to a context in a document. On the other hand, a
number of false negative pairs resulted in low recall. It is primarily because most
of the compound nouns and adjectives were not correctly extracted by depen-
dency analysis. It should be integrated to identify compound nouns according
to adjacency of terms.


5     Conclusion

The paper have proposed a novel method for semantic relation extraction and
presented a preliminary evaluation. Improvement of the recall is a subject for
future work.


References

 1. F.M. Suchanek, G. Kasneci, G. Weikum, Yago - A large ontology from Wikipedia
    and WordNet, Journal of Web Semantics, vol.6, no.3, pp.203-217, 2008.
 2. F. M. Suchanek, G. Ifrim, G. Weikum, Combining linguistic and statistical analysis
    to extract relations from web documents, Proc. of the 12th ACM International
    Conference on Knowledge Discovery and Data Mining, 2006.
 3. C. Bizer, J. Lehmann, G. Kobilarov, S. Auer, C. Becker, DBpedia - a crystallization
    point for the web of data, Journal of Web Semantics, vol.7, no.3, pp.154-165, 2009.
 4. M. Hearst, Automatic acquisition of hyponyms from large text corpora, Proc. of
    the 14th International Conference on Computational Linguistics, 1992.
 5. R. Snow, D. Jurafsky, A.Y. Ng, Learning syntactic patterns for automatic-
    hypernym discovery, Advanced in Neural Information Processing Systems, pp.1297-
    1304, 2005.
 6. S.D. Saeger, K. Torisawa, J. Kazama, K. Kuroda, M. Murata, Proc. of the IEEE
    International Conference on Data Mining, pp.764-769, 2009.
 7. A. Akbik, J. Brob, Wanderlust: extracting semantic relations from natural language
    text using dependency grammar patterns, Proc. of the Workshop on Semantic
    Search, pp.6-15, 2009.
 8. P. Pantel, D. Ravichandran, Automatically labeling semantic classes, Proc. of
    North American Chapter of the Association for Computational Linguistics, 2004.
 9. P. Pantel, M. Pennacchiotti, Espresso: leveraging generic patterns for automati-
    cally harvesting semantic relations, Proc. of the 21st International Conference on
    Computational Linguistics and the 44th annual meeting of the Association for
    Computational Linguistics, pp.113-120, 2006.
10. R. Feldman, M. Fresko, J. Godenberg, O. Netzer, L.H. Ungar, Extracting product
    comparisons from discussion boards, Proc. of the IEEE International Conference
    on Data Mining, pp.469-474, 2007.
11. B. Rosenfeld, R. Feldman, Self-supervised relation extraction from the web, Knowl-
    edge and Information Systems, vol.17, no.1, pp.17-33, 2008.
12. M. Banko, O. Etzioni, The tradeoﬀs between open and traditional relation ex-
    traction, Proc. of the 46th Annual Meeting of the Association for Computational
    Linguistics, pp.28-36, 2008.
13. J.R. Curran, T. Murphy, B. Scholz, Minimising semantic drift with mutual ex-
    clusion bootstrapping, Proc. of the 10th Conference of the Paciﬁc Association for
    Computational Linguistics, pp.172-180, 2007.
14. E. Agichtein, L. Gravano, Snowball: extracting relations from large plain-text col-
    lections, Proc. of the 5th ACM conference on Digital Libraries, pp.85-94, 2000.
15. K. Khan, B.B. Baharudin, A. Khan, F. e-Malik, Mining opinion from text docu-
    ments: a survey, Proc. of the 3rd IEEE International Conference on Digital Ecosys-
    tems and Technologies, pp.217-222, 2009.
16. B. Pang, L. Lee, Opinion mining and sentiment analysis, Foundations and Trends
    in Information Retrieval, vol.2, nos.1-2, pp.1-135, 2008
17. T. Kawamura, S. Nagano, M. Inaba, Y. Mizoguchi, WOM Scouter: mobile ser-
    vice for reputation extraction from weblogs, International Journal of Metadata,
    Semantics and Ontologies, vol.3, no.2, pp.132-141, 2008.