=Paper=
{{Paper
|id=Vol-355/paper-6
|storemode=property
|title=Semantic Relatedness Metric for Wikipedia Concepts Based on Link Analysis and its Application to Word Sense Disambiguation
|pdfUrl=https://ceur-ws.org/Vol-355/turdakov.pdf
|volume=Vol-355
|dblpUrl=https://dblp.org/rec/conf/syrcodis/TurdakovV08
}}
==Semantic Relatedness Metric for Wikipedia Concepts Based on Link Analysis and its Application to Word Sense Disambiguation==
Semantic Relatedness Metric for Wikipedia Concepts Based
on Link Analysis and its Application to Word Sense
Disambiguation
© Denis Turdakov, Pavel Velikhov
ISP RAS
turdakov@ispras.ru, pvelikhov@yahoo.com
Abstract traditional text similarity measures.
In our work we have used the links between
Wikipedia has grown into a high quality up-to- Wikipedia concepts to compute a semantic relatedness
date knowledge base and can enable many measure1. While there has been a number of prior works
knowledge-based applications, which rely on that introduced a variety of similar measures, we
semantic information. One of the most general present a simple measure that yields good results and at
and quite powerful semantic tools is a measure the same time can be computed efficiently.
of semantic relatedness between concepts. Specifically, we address the problems of computing top
Moreover, the ability to efficiently produce a k similar articles, given a specific Wikipedia article and
list of ranked similar concepts for a given computing the similarity measure between two articles.
concept is very important for a wide range of We present simple yet powerful heuristics that yield
applications. We propose to use a simple orders of magnitude performance improvements over
measure of similarity between Wikipedia the straightforward ranking method. The heuristics are
concepts, based on Dice’s measure, and also applicable to other complex ontologies, since they
provide very efficient heuristic methods to are based on statistical properties of complex networks.
compute top k ranking results. Furthermore, In prior work some authors have evaluated their
since our heuristics are based on statistical measure directly using human ranking. Instead, we have
properties of scale-free networks, we show that validated our approach by solving the problem of word
these heuristics are applicable to other complex sense disambiguation, for which a number of reference
ontologies. Finally, in order to evaluate the points is available. We believe such comparison is more
measure, we have used it to solve the problem reliable and easier to conduct, than setting up human
of word-sense disambiguation. Our approach experiments.
to word sense disambiguation is based solely The rest of the paper is organized as follows. In
on the similarity measure and produces results Section 2 we present out measure of semantic
with high accuracy. relatedness and discuss its use in a number of
applications. In Section 3 we raise the issue of efficient
1 Introduction computation of relatedness measures and provide a set
of heuristics that significantly improve the performance
Wikipedia is the leading open encyclopedia that has
of ranking with our measure and computing the
evolved into a comprehensive resource with very good
relatedness between a pair of concepts. In Section 4 we
coverage on diverse topics, important entities, events,
generalize our results to other complex ontologies.
etc. The English Wikipedia currently contains over 4
Then, in Section 5 we describe our approach in solving
million articles (including redirection articles). the problem of word sense disambiguation using the
Furthermore, Wikipedia contains quite a bit of relatedness measure and demonstrate our results.
structured information: it has a rich category structure,
Finally, we conclude with future work in Section 6.
separate pages for ambiguous terms, and structured data
for certain types of articles. Finally, it contains over 90
million links between articles. Most of these links 2 Semantic Relatedness
signify that some semantic relationship holds between Wikipedia has recently been widely recognized as an
the source and target concepts and can used to compute enabling knowledge base for a variety of intelligent
a measure of relatedness that typically outperforms systems. However, Wikipedia is not suitable for
Proceedings of the Spring Young Researcher's 1
Sometimes semantic similarity is used interchangeably
Colloquium On Database and Information Systems
with semantic relatedness, however we consider the
SYRCoDIS, St.-Petersburg, Russia, 2008
later term to be more specific.
machine consumption as is and in order to bootstrap articles. For example, the category “Indian Actors”
various tools with Wikipedia knowledge some semantic contains over 600 articles and the degree of relatedness
extraction needs to take place. In our work we extract a between all of these actors is not very significant.
simple relatedness measure between Wikipedia articles Therefore, we identify articles that have both: a link
that proves useful in a variety of tasks such as query between them, and share the same category, as the next
refinement in search engines, document classification, most relevant type of link.
document clustering, faceted browsing, etc. The rest of the links are just regular Wikipedia links,
Traditionally, the above applications either use basic but we separate out of them into further types: Date
vector-space similarity functions (in case of links and Template Links, which receive the lowest
classification and clustering [13]) or rely on pre-built weights in our scheme. In our experiments we have
ontologies (for query refinement and faceted used the following weighting scheme, shown in Table
browsing[14]). With a high quality semantic relatedness 1.
measure, we can greatly enhance the quality of results
in these applications. For example, [12] presents a new See Also 5 Double Link: 2
classification technique that uses a semantic relatedness Inverse See Also 2 Common Category 1.5
measure and achieves excellent results. Regular Link 1 Inverse Regular Link 0.5
Date 0.1 Template 0.1
2.1 Related Work
We can divide the previous work in similarity measures Table 1: Weights for various link types
into two broad classes: basic measures inspired by Finally, we have also incorporated IDF weighting
traditional IR metrics, such as the cosine metric [3,4, into our measure. Since our measure is based on Dice
11], and graph theoretic measures, such as (and not cosine), we don’t scale the IDF with a
SimRank[8,12]. While the second class typically logarithmic function, we simply use it as a multiplier.
provides a better quality measure, the computational While the IDF-enhanced measure produces better
efficiency of these methods is not high enough to be subjective results, it does not significantly improve our
used in practical data intensive applications. Therefore, evaluation results on Word Sense Disambiguation
in our work we analyze a measure based on Dice’s problem.
measure that is commonly used in IR.
2.2 Weighted Dice Metric 3 Efficient Ranking and Computation
Dice’s measure has a very intuitive meaning: in case In many applications, such as document classification,
of Wikipedia articles two pages will be related, if the faceted browsing and word sense disambiguation, we
fraction of the links they have in common to the total need to retrieve the list of articles related to a given
number of links of both pages is high. More formally, article, ranked by their relatedness measure. Also, some
2 " n(A)I n(B) applications might need to precompute relatedness
Dice(A,B) = scores in order to speed up online processing. Typically,
n(A) + n(B) only a small percentage of top ranking related concepts
where n(A) is the set of articles linking (considering are needed in these examples. It turns out that with the
both incoming and outgoing links) to article A, and n(B) scale of Wikipedia, the task of ranking top k articles for
is the set of articles linking to B. While exploring the a query becomes non-trivial. For example, the article
!structure of Wikipedia, we have noticed that some types “United Kingdom” has over 80 thousand incoming and
of links are extremely relevant to semantic relatedness, outgoing links and thus has non-zero relatedness score
while some other types lead to wrong results. Hence we with over a million other Wikipedia articles. A naïve
have added a weighting scheme to the basic measure, approach towards computing the top k results for such
based on the following link types: articles becomes intractable. Therefore, in order to build
See Also Links: Most Wikipedia articles have a See practical systems that make use of the relatedness
Also section that lists related articles. These links measure, we have to provide heuristics that may
explicitly signify that a linked page is semantically sacrifice the quality of the measure slightly, but yield
related. Therefore, see also links are very important for significant improvements in computational efficiency.
semantic relatedness and we assign the highest weight Below we present four heuristics that dramatically limit
to the links of this type. Inverse See Also Links the search space of potential related concepts. All of
(incoming links) are also quite important and they these heuristics are based on the observation that related
receive a high weight also. articles typically link to each other or have a common
Double links: Articles that link to each other link with another, highly related article.
directly by regular links in most cases turn out to be
quite related, hence these types of links come next in 3.1 Limiting the Search Space
our weighting scheme. OL Heuristic (Outgoing links). Our first heuristic is
Links between articles in the same category: targeted at articles with large degrees. When we are
Wikipedia has a rich category structure, and articles computing top k related pages for these articles, we can
belonging to the same category are related. However, limit our search only to the outgoing links, which are a
some categories are very broad and consist of unrelated
tiny percentage of incoming links for articles with a AL Random Sample Heuristic. This heuristic picks
high degree. For example, “United Kingdom” has only 100 links randomly from a pair of articles and computes
900 outgoing links, although its total degree is over 80 the relatedness based on this sample. It performs
thousand. However, this heuristic works well for very surprisingly well, producing a ranking very similar to
large articles, but produces poor results on articles with the AL heuristic.
intermediate degrees.
OL Top 20 Heuristic. We improve the previous 3.2 Accuracy and Performance
heuristic by making another observation: similar articles The accuracy and performance of the above
that don’t have a direct link between them, will link to a heuristics are presented in Figure 1 and 2. We
closely related article. Our second heuristic expands the demonstrate the accuracy of out methods by comparing
search for top k related articles by including all regular the top 20 ranked results produced by the heuristic
outgoing links of the top 202 ranked articles, produced method with the exhaustive search. To estimate the
by the OL heuristic. This still leads to very efficient efficiency, we plot the search space of our heuristics vs.
raking, since the average number of outgoing links in the search space of an exhaustive method. Due to the
Wikipedia is about 90 and grows slowly with the total computational difficulties of performing exhaustive
number of links.3 search that is required in the comparisons, we used a
sample of articles that have at most 15000 links.
However, the trends can be easily extrapolated to larger
articles.
Figure 1: Performance of heuristic methods
AL (All Links) and AL Top 20 Heuristics. For
articles with a relatively small number of links, the first
two heuristics yield poor results. However, since the
degrees of the articles are smaller, we can expand the
search space without too much computational penalty.
AL heuristic considers all articles with an incoming or
an outgoing link. AL Top 20 expands the search space Figure 2: Accuracy of heuristic methods
of AL with regular outgoing links of 20 top ranking
articles.
4 Other complex ontologies
3.1 Efficient Computation of Relatedness Measure As we have mentioned above, our results are
applicable not only to Wikipedia, but also to other
So far we have focused on limiting the search space of
graphs with similar properties. Wikipedia is an instance
potential high-ranking articles. However, computing the
of scale-free network[9], that has two important
similarity measure is also an expensive operation, since
properties: the degree distribution of nodes (articles)
Wikipedia articles have thousands of links. To address
follows a power law, and the clustering coefficient (the
this problem, we propose to use a simple randomized
probability that two articles have a direct link, if they
algorithm to compute an approximate similarity
are connected through a chain of length two) is higher
measure.
than in random graphs and also follows a power law.
These properties were discovered for Wikipedia in
[10,13]. We give an informal proof for our heuristics,
2
We choose to pick 20 top ranking articles, since it based on the clustering property of Wikipedia articles.
gives us a good tradeoff between improved accuracy Given an article a, its top k related articles will have a
and slightly higher computational complexity. large number of links in common with a, which is
3
The number of incoming links, on the other hand, especially true for articles with a large in and out
grows fast and reaches very high numbers, as in the degrees. In such case, the probability that there will be a
case of “United Kingdom”.
direct link from article a to an article within top k categories, reporting accuracies between 55.4% and
related articles is very high due to the clustering 84.8% on (55-word context, entity) pairs extracted from
property. Our measurements for article of similar sizes Wikipedia, depending on the model and the
yield a clustering coefficient of 7*10-3. Now, lets development/test data employed.
consider articles that are related to “United Kingdom”: The method that is most similar to our approach is
“Labour Party” is in the 20th rank wrt. to our presented in [3]. The authors use Explicit Semantic
relatedness measure. “United Kingdom” and “Labour Analysis (ESA), a novel method that represents the
Party” have about 3000 common neighbours, hence the meaning of texts in a high-dimensional space of
probability that there will be a direct link between these concepts derived from Wikipedia. ESA works by first
articles is extremely close to 1. For articles with such a building an inverted index from words to all Wikipedia
high degree, the OL heuristic is sufficient to achieve articles that contain them. Then, it estimates a
near 100% accuracy. As we consider articles with a relatedness score for any two documents by using the
smaller degrees, the clustering coefficients increase, inverted index to build a vector over Wikipedia articles
however the number of common neighbours for top for each document and by computing the cosine
ranking articles drops very quickly. Therefore, we need similarity between the two vectors.
to employ heuristics that expand the search space
beyond first degree neighbours. 5.2 Computing candidate word meanings
We make use of Wikipedia’s disambiguation pages
5 Word Sense Disambiguation and redirection articles obtain candidate meanings of
ambiguous words. Then we use our relatedness measure
Word Sense Disambiguation (WSD) is a complicated
to pick the meaning that has the highest relevance to the
but extremely important task in natural language
context where the ambiguous word appeared. Wikipedia
processing. Unresolved ambiguous words can for
contains special types of articles for ambiguous terms.
example significantly decrease the precision of text
For each ambiguous term these pages contain all of the
classification: the word “platform” can be used in the
word’s meanings, which are separate articles in
expression “railway platform”, or it can refer to a
Wikipedia with their own link structure. For example
hardware architecture or a software platform. Without
the article “platform (disambiguation)” contains 16
disambiguating the meaning of ambiguous words, text
meanings of the word “platform”. At the end of 2007
classification, information retrieval and other
there were more then 80000 disambiguation pages in
algorithms will produce erroneous results.
Wikipedia and this number is growing.
While a lot of research has been carried out on this
Wikipedia disambiguation pages are not completely
topic, the problem is still considered hard. For us WSD
structured and there are no simple rules that guarantee
represents a perfect test bed to evaluate our relatedness
an error-prone extraction of all possible meanings. We
measure: we use a very simple method, based solely on
illustrate this problem with an example. Usually
semantic relatedness, and compare our results with
different meanings are listed in the first paragraph of the
existing approaches.
disambiguation page or in lines marked with “*”. But
In the next subsection we will take a look at some
often such lines have links to pages that are unrelated to
of WSD methods that utilize Wikipedia and compare
the ambiguous term. For example the article “war
them with our approach.
(disambiguation)” have the following choice among its
5.1 Related work meanings:
"War", a song by Joe Satriani off “The Extremist” 4
There are two basic approaches to disambiguation: If we collect the links from such lines we will have
approach based on machine learning [1,2] and methods quite a few erroneous disambiguation candidates.
based similarity models [3,4,5,6]. [1] investigates the Instead, we pick only the first link from each line of the
method for enhancing performance of supervised disambiguation page if the text preceding the link
learning algorithms. One significant drawback of doesn’t contain the ambiguous term itself or its
supervised systems they are applicable to those few acronym. (We skip the link in the example above
words for which sense tagged data is available, and because it has the term “war” appearing in the text
their accuracy is strongly connected to the amount of before the link).
labeled data available at hand. The approach described Some ambiguous terms that stem from case
in [1] uses Wikipedia as a source of sense annotations sensitivity of Wikipedia don’t have corresponding
for building sense tagged corpora. Authors suggested to disambiguation pages, but we can still infer the different
use Wikipedia link structure for generation of sense meanings of the terms. We convert all Wikipedia terms
annotated corpora. We use a similar method to generate to upper case and create disambiguation pages when we
the test corpora for evaluating our method. have conflicts. For example, “SUX” and “Sux” point to
A similar method was presented in [2]. The authors “Sioux Gateway Airport” and “Suxamethonium
employed Wikipedia entity pages, redirection pages, chloride” correspondingly, and we create the
categories, and hyperlinks and built a context-article appropriate disambiguation page for them.
cosine similarity model and an SVM based on a
taxonomy kernel. They evaluated their models for
person name disambiguation over 110, 540, and 2,847 4
The links are shown in italic font
For our experiments we select 7 closest terms from 5.4 WSD based on Semantic Relatedness
the context where the ambiguous term appears, then
We improve over the naïve method by using the
compute the semantic distance by two methods: a naïve
measure of semantic relatedness. Given a meaning
method, described in the next subsection only uses the
candidate c, and a set of context terms t1 …tn, we
Wikipedia graph, and the semantic relatedness measure
compute a ranked list of related articles R(c) for c and
that we introduced above.
R(ti) for each ti in t1…tn. We use the heuristic AL,
5.3 WSD Method based on Wikipedia graph (Naïve) described in the previous section, in order to obtain
these lists efficiently. The distance between c and t1 …tn
For the naïve method, we define semantic distance is computed as follows. We first take a union of R(ti)
between the context and a candidate term meaning as lists, summing up weights of repeating articles, and
the amount of common articles in the neighborhoods5 of obtain R(T). Then we compute the similarity between
the context and the term in Wikipedia. R(c) and R(T) using the following methods: Sum of
products, Cosine, Dice and Jaccard measures. We apply
these four measures in the naïve approach as well,
where the weights are binary. Finally, we pick the
candidate that maximizes the given similarity function.
5.5 Experiments and Results
One of the problems of evaluating WSD techniques is
that there are no conventional benchmarks. Also, WSD
Figure 3: Naive WSD Method problem has many variations and must be carefully
For example, consider the disambiguation process scoped. For instance most NLP tools solve the problem
for the term “platform” in the text “Jigsaw is W3C's of Part-of-Speech tagging, which is a special case of
open-source project that started in May 1996. It is a WSD, however we only focus on disambiguating proper
web server platform that provides a sample HTTP 1.1 nouns. Hence, we created our own benchmarks from the
implementation and … “. There are four Wikipedia Wikipedia content itself. Wikipedia links often have the
concepts around this word: “open-source”, “web following structure:
server”, “HTTP” and “implementation”. For each of [[ part1 | part2 ]],
the 16 meanings of the word “platform”, the system where part1 is a normalized Wikipedia topic and
calculates the amount of common articles between the part2 is text in the link anchor that is presented to the
neighborhood of each meaning and the union of the user. If part2 is an ambiguous term, then part1 is
neighborhoods of each topic from the context. The usually a Wikipedia topic corresponding to one of the
biggest intersection is with the neighborhood of the term meanings. We use the dictionary created from
“platform (computing)” article. It has 122 common disambiguation pages to parse Wikipedia articles and
articles. Next candidate is “platform game” with only find links with ambiguous terms in part2, which at the
18 common articles. Therefore, we decide that in this same time have one of disambiguation candidates in
context the meaning of the word “platform” is a part1. For example, there is a link [[ platformism |
computer platform. platform ]] in the article “Anarchism”. Here it means
This simple method shows good precision (62.2%), that ambiguous term “platform” has a meaning
as illustrated in Table 1, however it has some “platformism”.
drawbacks. First, it uses only nearest neighbors to
compute semantic distance. But sometimes semantically Naive Semantic Relatedness
related articles will not have a direct link between each Intersection 61,83 71,8
other. However, if we take into account all second Cosine 59,82 72,52
neighbors of article, the WSD task will became too Dice 61,42 72,58
computationally complex. Next problem is that links Jaccard 60,28 72,58
between two articles can be semantically poor. For
example, a link between “Moscow” and “Fahrenheit”
makes little sense. This can significantly decrease the Table 2: Comparison of WSD methods
precision of this method. Hence, a natural way to
increase precision of this technique is to use the Therefore, we don’t we don’t rely on NLP methods
relatedness metric in order to compute the similarity to search for ambiguous words and we can easily get
between a candidate meaning and its context. the correct answer for the disambiguation using part1.
We used this technique for constructing a small test
corpus with 1000 ambiguous terms in order to evaluate
our methods.
The best result (72,58%) is obtained by using the
5
By a neighbor of an article we define all Wikipedia Dice and Jaccard measures with the semantic
articles that have an incoming or an outgoing link to the relatedness method. This results shows that we achieve
original article. much better precision when using the semantic
relatedness measure in comparison with the naïve [12] and computing term-context similarity measure.
approach (best result 61,83%). The results are We also plan to incorporate linguistic analysis into our
summarized in Table 2. text parser and and we expect to improve WSD results
We also ranked potential word meaning by significantly.
similarity weight in descending order and inspected the
top 2 and 3 results. The average number of meanings References
for an ambiguous word was 17,65 for our test set.
Almost 87% of right answers were in top two positions [1] Rada Mihalcea. Using Wikipedia for
of sorted lists and top 3 positions contained more then AutomaticWord Sense Disambiguation.
91% of right answers. This result is summarized in Proceedings of NAACL HLT 2007, pages 196–
Table 3. 203, Rochester, NY, April 2007
The difference between similarity scores of the right [2] Razvan Bunescu, Marius Pasca. Using
and first answers was very small. Hence, there is a Encyclopedic Knowledge for Named Entity
potential to improve our results further, by improving Disambiguation.
our context modeling methods and incorporating [3] Gabrilovich, E. and S. Markovitch. 2007.
linguistic analysis. Currently, some verbs that are Computing semantic relatedness using Wikipedia-
homonyms to nouns are frequently detected as nouns. based explicit semantic analysis. Proceedings of
For example, the verb “aims” was detected as acronym IJCAI, 1606-1611
“AIMS”. Furthermore, due to the limitations of our text [4] Strube, M. and S. P. Ponzeto. 2006. WikiRelate!
parser, we detect Wikipedia terms only in normal form. Computing semantic relatedness using Wikipedia.
We believe that by improving our method to avoid these In Proceedings of AAAI, 1419-1424.
problems we will achieve the performance close to that [5] Silviu Cucerzan. Large-Scale Named Entity
of human experts. Disambiguation Based on Wikipedia Data. In Proc.
2007 Joint Conference on EMNLP and CNLL,
Top-2 Top-3 pages 708–716, Prague, The Czech Republic, 2007.
Intersection 85,12 89,03 [6] Simon Overell, Joao Magalhaes and Stefan Ruger.
Cosine 87.46 90.86 Place disambiguation with co-occurrence models
Dice 86.95 91.38 [7] Mihalcea, R., T. Chklovski, and A. Kilgarriff. The
Jaccard 86.95 91.38 Senseval-3 English lexical sample task. In
Proceedings of SENSEVAL-3, 25-28
Table 3: Percentage of top ranked correct meanings [8] Glen Jeh , Jennifer Widom. SimRank: a measure of
structural-context similarity. Proceedings of the
eighth ACM SIGKDD international conference on
5 Conclusions and Future Work Knowledge discovery and data mining, July 23-26,
We have presented a simple measure of semantic 2002, Edmonton, Alberta, Canada
relatedness, based on the link structure of Wikipedia. [9] Albert R. and Barabási A.-L. Statistical mechanics
We addressed the problem of computing this measure of complex networks. Rev. Mod. Phys. 74, 47–97
efficiently and have provided heuristics for computing (2002).
top k related articles. These heuristics achieve high [10] Turdakov D. Recommender system based on user-
accuracy, but limit the search space drastically and generated content. Proceedings of the Spring
make the approach suitable for practical use in a variety Young Researcher's Colloquium on Database and
of data intensive systems. We also presented a Information Systems SYRCoDIS, Moscow, Russia,
randomized algorithm to compute the relatedness 2007
measure between two articles efficiently and shown that [11] David Milne, Computing Semantic Relatedness
its accuracy in ranking is very close to the true measure. using Wikipedia Link Structure, Procroceedings of
In order to evaluate the quality of the measure, we have New Zealand Computer Science Research Student
presented a simple method for word sense Conference NZCSRSC, 2007
disambiguation, based on the relatedness measure. We [12] Ollivier Y. and Senellart P. Finding Related Pages
evaluated our approach and found it to perform on par Using Green Measures: An Illustration with
with the competing approaches and close to the Wikipedia, In Proceedings of the 22nd National
performance of human experts. Conference on Artificial Intelligence (AAAI’07),
In future work, we will explore more heuristics with the Vancouver, Canada, 22-26 July 2007
aim to produce a single tunable method with an explicit [13] Voβ, J. Measuring Wikipedia. Proceedings of 10th
analysis of computational complexity and graceful International Conference of the International
degradation. Furthremore, we plan to formally prove the Society for Scientometrics and Informetrics,
quality of the heuristics using the statistical properties (Stockholm, Sweden), 2005.
of scale-free networks. This will enable us to estimate [14] Maciej Janik, Krys Kochut. "Wikipedia in action:
the quality of a specific tunable heuristic in advance, Ontological Knowledge in Text Categorization",
without the need to experiment. Finally, we are UGA Technical Report No. UGA-CS-TR-07-001,
planning to explore more sophisticated methods of November 2007
modelling context, similar to the method presented in