=Paper=
{{Paper
|id=None
|storemode=property
|title=Mining Diverse Views from Related Articles
|pdfUrl=https://ceur-ws.org/Vol-762/paper6.pdf
|volume=Vol-762
}}
==Mining Diverse Views from Related Articles==
Mining Diverse Views from Related Articles
Ravali Pochampally Kamalakar Karlapalem
Center for Data Engineering Center for Data Engineering
IIIT Hyderabad IIIT Hyderabad
Hyderabad, India Hyderabad, India
ravali@research.iiit.ac.in kamal@iiit.ac.in
ABSTRACT Articles which pertain to a common topic (e.g swine-flu in
The world wide web allows for diverse articles to be available India) are termed as ‘related’. By isolating views we aim to
on a news event, product or any topic. It is not impossible to organize content in a detailed manner than that of summa-
find a few hundred articles that discuss a specific topic thus rization. We define a view as
making it difficult for a user to quickly process the informa-
tion. Summarization condenses huge volume of information
related to a topic but does not provide a delineation of the A sentence or a set of sentences which broadly relate to an
issues pertaining to it. We want to extract the diverse issues issue addressed by a collection of related articles and aid in
pertaining to a topic by mining views from a collection of elaborating the different aspects of that issue
articles related to it. A view is a set of sentences, related in
content, that address an issue relevant to a topic. We present 1.1 Motivating Example
a framework for extraction and ranking of views and have Here is a pair of views obtained by our framework. Both
conducted experiments to evaluate the framework. the views are mined from Dataset 1. The number in the
curly brackets indicates the ID of the article from which the
Categories and Subject Descriptors sentence is extracted. Description of datasets is given in
H.5 [Information Systems]: Information Interfaces and Table 1.
Presentation
Example Views
General Terms
Human Factors, Experimentation 1. The irresponsibility of the financial elite and US ad-
ministrations has led the US economy to the brink of
collapse. {18} On Friday, the Dow was down a mere
Keywords 0.3% on the week - but to get there, the Fed and the
text mining, views, diversity, information retrieval Treasury had to pump hundreds of billions into the
global financial system. {14} The collapse of the oldest
investment bank in the country could strongly under-
1. INTRODUCTION mine the whole US financial system and increase the
The world wide web is a storehouse of information. Users credit crisis. {3} After a week that saw the collapse
who want to comprehend the content of a particular topic of Lehman Brothers, the bailout of the insurer AIG
(e.g. FIFA 2010) are often overwhelmed by the volume of and the fire sale of Merrill Lynch and the British bank
text available on the web. Websites which organize infor- HBOS, policy makers hit back, orchestrating a huge
mation based on content (google news1 ) and/or user ratings plan to sustain credit markets and banning short sales
(amazon2 , imdb3 ) also output several pages of text in re- of stock. {48} It was a dramatic reversal from the first
sponse to a query. It is difficult for an end-user to process half of the week, when credit markets virtually seized
all the text presented. up and stocks around the globe plunged amid mount-
ing fears for the health of the financial system. {18}
Multi-Document Summarization [2] is a prominent Informa-
tion Retrieval (IR) technique to deal with this problem of 2. The Swiss National Bank is to pump USD 27 billion
information overload. But summaries typically lack the se- into markets and the Bank of Japan (BOJ) valued its
mantic grouping to present the multiple views addressed by part in the currency swap with the Federal Reserve at
a group of articles. Providing diverse views and allowing 60 billion. {35} The Bank of Canada was also involved,
users to browse through them will faciliate the goal of in- and The Bank of England said it would flood 40 billion
formation exploration by providing the user a definite and into the markets. {26} And, despite the agreements
detailed snapshot of their topic of interest. that Barclays Capital and Bank of America will sign
with executives at Lehman Brothers or Merrill Lynch,
1
http://news.google.com/ it is the hunting season in the banking world for the
2
http://www.amazon.com/ crème de la crème. {14}
3
http://www.imdb.com/
The first view details the breakdown of the US economy ID Source Search Term # Articles
along with a few signs of damage control. The second view 1 google news financial meltdown 49
reports the actions of various banks during the financial tur- 2 google news swine flu india 100
moil in 2008. These views capture a glimpse of the specific 3 google news israel attacks gaza 24
issues pertaining to the topic of ‘financial meltdown’. A list 4 amazon.com the lost symbol 25
of such diverse views would organize the content of a collec- 5 tripadvisor.com hotel taj krishna 20
tion of related articles and provide a perspective into that 6 tripadvisor.com hotel marriott 16
collection. 7 google news fifa vuvuzela 39
8 google news gulf oil spill 26
The problem statement is
Table 1: Datasets
Given a corpus of related articles A, identify the set V of
views pertaining to A, rank V and detect the most relevant 1.3 Contributions
view (MRV) along with the set of outlier views (OV) The main contributions of this work are
1.2 Related Work
1. Defining the concept of a view over a corpus of related
Allison et. al [1] [8] proposed that providing multiple view-
articles
points of a document collection and allowing to move among
these view-points will facilitate the location of useful docu-
ments. Representations, processes and frameworks required 2. Presenting a framework for mining diverse views
for developing multiple view-points were put forth.
3. Ranking the views based on a quality parameter (cohesion)
Tombros et al. [10] proposed the clustering of Top-Ranking defined by us and
Sentences (TRS) for efficient information access. Cluster-
ing and summarization were combined in a novel way to
generate a personalized information space. Clusters of TRS 4. Presenting results to validate the framework
were generated by a hierarchical clustering algorithm using
the group-average-link method. It was argued that TRS
clustering presents better information access than routine
1.4 Organization
In section 2, we elaborate on the framework for the extrac-
document clustering.
tion of views. MRV, OV and the ranking mechanism are
TextTiling [5] is a technique for subdividing text into multi- explained in detail in section 2.5. Section 3 is for experi-
paragraph units that represent passages or subtopics. It mental evaluation and discussion. In section 4, we sum up
our contributions and outline the future work.
makes use of patterns of lexical co-occurence and distri-
bution. The algorithm has three parts: tokenization into
sentence-sized units, determination of a score for each unit
and detection of sub-topic boundaries. Sub-topic boundaries
2. EXTRACTION OF VIEWS
In this section, we detail the steps involved in the extraction
are assumed to occur at the largest valleys in the graph that
of views and define a quality parameter for ranking the views
result from plotting sentence-units against scores.
according to their relevance. Figure 1 presents an overview
of the framework by depicting the steps involved in the al-
1.2.1 Views vs. Summary gorithm. Input and output are specified for each step of the
Summary and views generated for Dataset 5 are here - algorithm.
(https://sites.google.com/site/diverseviews/comparison)
The summary is generated by update summarization ‘base- HTML + Text Raw Text
line algorithm’ [6]. It is conspicuous by the lack of organi- Set of Extracting
Data
zation. Though successful in covering the salient features Related
Cleaning & Top-ranking
Articles
of the review dataset, it groups several conflicting sentences (A)
Preprocessing Sentences
together (observe the last two sentences of the summary).
The views generated by our framework present an organized Top n
representation by generating clusters of semantically related Sentences
sentences. As is evident, the first view is discussing the pos-
itive attributes of hotel taj krishna in hyderabad while the Ranking By Clustering
second view is negative in tone. The third and fourth views Ranked Quality Engine
Views & Parameter
discuss specific aspects of the hotel such as the food and MRV (Cohesion)
the facilities available. Presenting multiple views for a topic OV
allows us to model the diversity in its content. Our repre- Ranked Views
sentation is concise as the average number of sentences per Views
view was found to be 3.9. In our framework, we address
two drawbacks of summarization - lack of organization and
verbosity (due to user-specified parameters).
Figure 1: Framework
2.1 Datasets Ti,j : tf − idfi,j
Articles which make relevant points about a common topic tf − idfi,j : TF-IDF of term ti in article dj
but score low on pairwise cosine similarity can be included tf − idfi,j : tfi,j ∗ idfi
n
in our datasets because we aim to present multiple views tfi,j : P i,j
from a set of related articles, rather than group them based k nk,j
n
Pi,j : Number of occurences of ti in article dj
P
on overall content similarity. We used data from news ag-
k nk,j : of occurences ∀ tk in article dj
gregator and review web sites as they group articles dis-
|D|
cussing a common topic, inspite of the low semantic similar- idfi : log
ity between them. We crawled articles published between |d : ti ∈ d|
|D| : Total number of articles in the corpus
a range of dates when the activity pertaining to a relevant
|d : ti ∈ d| : Number of articles which have the term ti
topic peaked. For example, we crawled articles published
on ‘gulf oil spill’ between 15 April 2010 and 15 July 2010
Table 2: Notations
when the news activity pertaining to that topic was maxi-
mum. We crawled websites which provided rss feeds or had
a static html format that could be parsed. Table 1 pro- ing the top-ranked ones. A list of notations used in our
vides the description of datasets. For instance, Dataset 1 is discussion is given in Table 2
collected from google news using the search term ‘financial
meltdown’ and contains 49 articles. Datasets can be found Let < S1 , S2 , S3 ...Sn > be the set of sentences in an article
here - (https://sites.google.com/site/diverseviews/datasets) collection. tf − idfi,j (TF-IDF) of a term ti in article dj is
obtained by multiplying its weighted term frequency ti,j and
2.2 Data Cleaning and Preprocessing inverse document frequency idfi . A high value of tf − idfi,j
Web data was collected using Jobo4 , a java crawler. The (Ti,j ) is attained by a term ti which has a high frequency in a
data was given as an input to the data cleaning and pre- given article dj and low occurence rate among the spectrum
processing stage. Data Cleaning is important as it parses of articles present in that collection. Appearance of some
the html data and removes duplicates from the articles. We words in an article is more indicative of the issues addressed
define a ‘duplicate’ as an article having the exact syntactic by it than others. Ti,j is a re-weighting of word importance,
terms and sequences, with or without the formatting dif- though it increases proportionally by the number of times a
ferences. Hence, by our definition, duplicates have a cosine word appears in an article, it is offset by the frequency of
similarity value of one. the word in the corpus. We consider a product of the Ti,j
of constituent words in a sentence to be a good indicator of
Text data devoid of html tags is given as an input to the its significance. A product can be biased by the number of
data preprocessing stage. Stemming and stopword removal words in a sentence hence, we normalize the product by di-
are performed in the preprocessing stage. Stemming is the viding it with the length of the sentence. Given the notation
process of reducing inflected (or derived) words to their stem above, we thus define the importance Ik , of a sentence Sk ,
or root form. (example: running to run, parks to park etc.) belonging to an article dj and having r constituent words as
In most cases, these morphological variants of words have
similar semantic interpretations and can be considered as Qr
i=1 Ti,j
equivalent for the purpose of IR applications. Stopwords Ik =
r
are the highly frequent words in english language (example:
a, an, the, etc.). Owing to their high frequency, and usage
as conjunctions and prepositions, they do not add any sig- Ti,j = tf − idf of word wi ∈ Sk ∧ dj
nificant meaning to the content. Hence, their removal is es- Ik = product of tf − idf of ∀wi normalized according to
sential to remove superfluous content and retain the essence sentence (Sk ) length, r
of the article. In order to capture the user notion, the re-
view datasets were not checked for typographical and gram-
matical errors and were retained verbatim. Python mod- Logarithm normalization was not used as the σ value for
ules HTMLParser5 and nltk.wordnet6 were used to parse r was 2.2 and variance in its value was not exponential.
the html data and perform stemming respectively. IR met- Sentences are arranged in the non-increasing order of their
rics such as word frequency and TF-IDF7 were extracted for importance (I) scores. We choose the top n sentences for our
future analysis. analysis. Experiments are conducted to correlate the range
of n with the corresponding score obtained by our ranking
parameter.
2.3 Extraction of Top-Ranking Sentences
A dataset consisting of many articles and having content 2.4 Mining Diverse Views
spanning various issues needs an pruning mechanism to ex- A measure of similarity between two sentences is required
tract sentences from which the views can be generated. We to extract semantically related views from them. Semantic
prune a dataset by scoring each sentence in it and extract- similarity calculates the correlation score between sentences
4
http://java-source.net/open-source/crawlers/jobo based on the likeness of their meaning. Mihalcea et al. [7]
5
http://docs.python.org/library/htmlparser.html proposed that the specif icity of a word can be determined
6
http://www.opendocs.net/nltk/0.9.5/api/nltk.wordnet- using its inverse document frequency (idf). Using a metric
module.html for word to word similarity and specificity, the semantic sim-
7 ilarity of two text sentences Si and Sj , where w represents
http://nlp.stanford.edu/IR-book/html/htmledition/tf-idf-
weighting-1.html a word in a sentence, is defined by them as
P
w∈{Si } (maxSim(w,Sj )∗idf (w))
sim(Si , Sj ) = 21 ( P + As per our definition, higher the value of cohesion, greater is
w∈{Si } idf (w)
P the content similarity between the sentences of a view. Our
w∈{Sj } (maxSim(w,Si )∗idf (w))
P ) framework wanted to ascribe importance to views with max-
w∈{Sj } idf (w)
imum pairwise semantic similarity. Thus, we defined Most
Relevant View (MRV) as the view with maximum value of
This metric is used for our analysis as it combines the se- cohesion, i.e., maximum content overlap amongst its con-
mantic similarities of each text segment with respect to the stituent sentences. Outlier views (OV) represent the set of
other. For each word w in the segment Si , we identify the views containing a single sentence. They are termed as out-
word in segment Sj that has the highest semantic similarity, liers because their semantic similarity with others is too low
i.e. maxSim(w, Sj ), according to some pre-defined word-to- to have any meaningful grouping. We rank all the views in
word similarity measures. Next, the same process is applied the non-increasing order of their cohesion. As their corre-
to determine the most similar word in Sj with respect to sponding pair-wise similarity is zero, outlier views have a
the words in Si . The word similarities are then weighed cohesion value of zero. Hence, we order outlier views ac-
with corresponding word specificities, summed up and nor- cording to their importance (I) scores.
malized according to the length of each sentence.
2.6 Framework for Extracting Views
Wordnet based similarity measures score well in recogniz- Algorithm 1 provides the steps involved in mining diverse
ing semantic relatedness [7]. Pucher [9] has carried out the views from a set of related articles. The articles are cleaned
performance evaluation of all the wordnet based semantic by parsing the html and removing duplicates. IR metrics
similarity measures and found that wup [4] is one of the top such as TF-IDF are collected before calculating the impor-
performers in capturing semantic relatedness. We also chose tance (I) of each sentence. The sentences are ranked in the
wup because it is based on the path length between synsets non-increasing order of their importance to pick the top n
of words and its performance is consistent across various sentences. We calculate the pair-wise semantic similarity
parts-of-speech (POS). We used Python nltk.corpus8 to im- between the chosen sentences to cluster them. Clustering
plement wup. Pairwise semantic similarity sim(Si , Sj ) or is used to generate semantically related views from a set of
si,j is a symmetric relation. Thus, we used the upper train- disparate sentences. We rank the views according to the
gle of the similarity matrix (X) to reduce computational quality parameter proposed by us.
overhead.
3. EXPERIMENTAL EVALUATION
∀si,j ∈ X =⇒ {si,j = sj,i } Extraction of Top-Ranking sentences requires the number
of constituent sentences (n) as an input. The ideal range of
values for an input parameter is the one which can maximize
We used clustering to proceed from a set of sentences to the cohesion of views and determining it is a critical part of
views containing similar content. The similarity-matrix (X) our framework. Hence, we analysed the result data to find
was given as an input to Python scipy-cluster9 which uses the relevant range for n.
Hierarchical Agglomerative Clustering (HAC). HAC was used
because we can terminate the clustering when the values of An input parameter producing views where the median co-
the scoring parameter converge without explicitly specifying hesion is greater than (or equal to) the mean is preferred.
the number of clusters to output. As the mean is influenced by the outliers in a dataset, the
median being at least as high as the mean indicates con-
Each cluster comprises of sentences grouped according to sistency across the values of cohesion. If all the values of
the similarity measure (si,j ) discussed above. Hence, it is mean cohesion are greater than that of median, the input
logical to treat them as views discussing a specific issue. In parameter yielding views with the maximum mean cohesion
the next section, we propose a quality parameter for the is preferred.
ranking and evaluation of views.
We collected statistics about the cohesion (mean, median),
2.5 Ranking of Views number of views, outliers etc. for values of n equal to 20,
25, 30, 35, 40, and 50. The results are presented in Table 5.
Qualitative parameter for ranking the views focuses on av-
ID indicates the dataset-ID (as per Table 1), TRS stands for
erage pairwise similarity between constituent sentences of a
the number of Top-Ranking sentences, V and O stand for
view V in order to define its cohesion (C). We define cohe-
the number of views and outliers respectively.
sion as
P Figures 2 to 9 plot the variation in the mean and median
i,j∈V si,j cohesion in relation to the number of TRS (n). The value of
C= n is plotted on the horizontal axis and the value of cohesion
len(V )
is plotted on the vertical axis. We can deduce from the
si,j = sim(Ti , Tj ) graphs that the mean and median cohesion are peaking for
V = set of sentences (Ti ) comprising the view 20 ≤ n ≤ 35. The exact breakup of the value of n yielding
len(V ) = number of sentences in the view. the best cohesion for all the datasets is provided in Table 3.
8
http://nltk.googlecode.com/svn/trunk/doc/api/nltk.corpus- As evident from our results, choosing more top-ranking sen-
module.html tences need not necessarily lead to views with better cohe-
9
http://code.google.com/p/scipy-cluster sion. To extract views with best cohesion one can start with
a lower bound (e.g. 20) of top-ranking sentences and incre-
mentally add x sentences until one reaches an upper bound
(e.g. 35). Incremental clustering [3] can be used to obtain
views. The cohesion values can be compared to present the
set of views which yield the best cohesion. Below we present
three views mined by our framework. The value of n for each
view is the one which yields best cohesion for that dataset
(as presented in Table 3)
Example 1 | fifa vuvuzela (7) | n: 35 | cohesion: 40.71
Algorithm 1 Mining Diverse Views (MRV)
Require: Related Articles A The true origin of the vuvuzela is disputed, but Mkhondo
Ensure: Ranked Views V with M RV and OV and others say the tradition dates back centuries - ”to our
1: for all a in A do forefathers” - and involves the kudu.{5} The plastic trum-
2: aClean ← ParseHTML(a) pets, which can produce noise levels in excess of 140 decibels,
3: have become the defining symbol of the 2010 World Cup.
4: if aClean is not duplicate then {12} For this reason, there is no doubt that the vuvuzela
5: ACLEAN ← ACLEAN + aClean will become one of the legacies that Africa will hand over to
6: else the world after the world cup tournament, since the Euro-
7: discard aClean peans, Americans and Asians could not resist the temptation
8: end if of using it and are seen holding it to watch their matches.
9: end for {3} Have you ever found yourself in bed in a dark room with
10: for all a in ACLEAN do just a single mosquito for company? The buzzing sound of
11: a ← removeStopwords(a) //ranks.NL stopwords the vibrations made by the mosquito’s wings. {10} On the
12: ASTEM ← ASTEM + stem(a) //nltk stemmer other hand, its ban will affect the mood of the host nation
13: end for and, of course, other African countries at the world cup, be-
14: for all a in ASTEM do cause of the deep rooted emotions attached to it by fans.
15: {3} This has sparked another controversy in the course of
16: for all word in a do the tournament and has become the single item for discus-
17: computeTFIDF(word) sion in the media since the LOC made that controversial
18: end for statement on Sunday evening. {3}
19: end for
20: for all sentence in ASTEM do Example 2 | swine flu india (2) | n: 25 | cohesion:
21: rankedSentences ← calculateImportance(sentence) 4.52 (Rank 4)
//section 2.3 Patnaik, who created the image with the help of his students,
22: end for on the golden beach has depicted the pig wearing a mask
topN ← pickTOPsentences(rankedSentences,n) //as per with the message ‘Beware of swine flu’. The sculpture was
importance (I) put on display late Thursday on the beach in Puri, 56 km
23: for all sentence1 as s1 in topN do from the state capital Bhubaneswar. {18} Of the six cases
24: for all sentence2 as s2 in topN do reported in Pune, three are students who contracted the
25: if (s1,s2) not in simMatrix then virus in the school. {91}
26: simMatrix ← simMatrix +
calculateSimilarity(s1,s2) Example 3 | the lost symbol (4) | n: 20 | cohesion:
27: end if 40.02 (MRV)
28: end for I read the book as fast as I could. Of course as a Dan
29: end for Brown classic, it was very interesting, exciting and made
rawViews←clusteringEngine(simMatrix)//scipy-cluster me wanting to read as fast as I could. {13} Every symbol,
30: for all view in rawViews do every ritual, every society, all of it, even the corridors and
31: views←views+calculateCohesion(view)//section 2.5 tunnels below Washington, DC, it’s all real. {3} I feel more
32: end for connected to the message of this book (the reach and the
rankedViews ← rankByCohesion(views) power of the human mind) than I did to possibility that
M RV ← chooseMaxCohesion(rankedViews) Jesus had a child. {12} Malakh is after secret knowledge
OV ← chooseZeroCohesion(rankedViews) guarded by the Masons and he’ll stop at nothing to get it.
To that end he’s kidnapped Peter Solomon, the head of the
Masonic order in Washington, DC. {1} Malakh is about to
reach the highest level in the Masonic order, but even so,
he knows he will not be admitted to the most secret secrets.
Secrets that he’s sworn to protect. He is not what he seems
to his fellow Masons. He’s lied to them. He has his own
agenda. {2} [sic]
The first and third examples were ranked first (MRV) by
our framework and the second one was ranked fourth. If we
examine the first example, a user who does not know the Dataset : Number of TRS
term ‘vuvuzela’ can immediately glean that it is a plas- financial meltdown : 25
tic trumpet which caused quite a stir in the fifa world cup swine flu india : 25
2010. There are also some sentences which insinuate toward israel attacks gaza : 30
a likely ban and surrounding controversy. In an ideal sce- the lost symbol : 20
nario, we would like to group sentences about the ban and hotel taj krishna : 20
the controversy in another view, but as it stands now, our hotel marriott : 25
view describes the instrument and the impact of vuvuzela fifa vuvuzela : 35
on the world cup and serves as a good introduction to a gulf oil spill : 30
novice or as a concise issue capsule to a user who is already
familiar with the topic. Table 3: Breakup of n
Similarly, the second example which was ranked fourth by Dataset Mean (S) Mean (N)
our framework talks about the repercussions of the disease financial meltdown 3.91 3.17
swine flu on pune and puri (cities in India). The third exam- swine flu india 4.26 5.67
ple, ranked first, contains some positive opinions about the israel attacks gaza 3.74 4.17
book ‘The Lost Symbol’ and also a sneak peek into the in- the lost symbol 4.16 5.33
tentions of the character Malakh. Additional example views hotel taj krishna 3.67 4.17
are provided in the appendix. hotel marriott 3.82 5.83
fifa vuvuzela 3.56 5.33
The average number of sentences across all the views was gulf oil spill 4.21 5.00
found to be 3.9 and the average number of views across all
the datasets was found to be 4.88. Table 4 presents the Table 4: Mean values
breakup for each dataset. Mean (S) indicates the average
number of sentences across all the views, and Mean (N)
indicates the average number of views. The implementa-
tion of the framework as described in Algorithm 1 took an
upper-bound of 4.2 seconds to run, with computeTFIDF and
calculateImportance being the time consuming steps at 2.6
seconds.
5. REFERENCES
The main difference between summarization and our frame- [1] J. C. F. Allison L. Powell. Using multiple views of a
work is that we provide multiple diverse views as opposed to document collection in information exploration. In
summarization which lacks such an organization. We also CHI’98: Information Exploration Workshop, 1998.
rank these views thereby allowing a user to just look at the [2] R. K. Ando, B. K. Boguraev, R. J. Byrd, and M. S.
Most Relevant View (MRV) or the top x views as per his Neff. Multi-document summarization by visualizing
convienience. As we provide the IDs of the source articles topical content. In Proceedings of the 2000
in each view, a user can also browse through them to know NAACL-ANLPWorkshop on Automatic
more about that view. summarization - Volume 4, NAACL-ANLP-AutoSum
’00, pages 79–98, Stroudsburg, PA, USA, 2000.
Association for Computational Linguistics.
4. CONCLUSION [3] M. Charikar, C. Chekuri, T. Feder, and R. Motwani.
Users who want to browse the content of a topic on the Incremental clustering and dynamic information
world wide web (www) have to wade through diverse arti- retrieval. In Proceedings of the twenty-ninth annual
cles available on it. Though summarization is successful in ACM symposium on Theory of computing, STOC ’97,
condensing huge volume of information, it groups several is- pages 626–635, New York, NY, USA, 1997. ACM.
sues pertaining to a topic together and lacks an organized [4] Z. W. Department and Z. Wu. Verb semantics and
representation of the underlying issues representing it. In lexical selection. In In Proceedings of the 32nd Annual
this paper, we propose a framework to mine the multiple Meeting of the Association for Computational
views addressed by a collection of articles. These views are Linguistics, pages 133–138, 1994.
easily navigable and provide the user a detailed snapshot of [5] M. A. Hearst. Texttiling: segmenting text into
their topic of interest. Our framework extends the concept multi-paragraph subtopic passages. Comput. Linguist.,
of clustering to the sentence or phrase level (as opposed to 23:33–64, March 1997.
document clustering) and groups semantically related sen- [6] R. Katragadda, P. Pingali, and V. Varma. Sentence
tences together to organize content in a way that is different position revisited: a robust light-weight update
from text summarization. summarization ’baseline’ algorithm. In CLIAWS3 ’09:
Proceedings of the Third International Workshop on
In future, we want to determine the polarity of a view (posi- Cross Lingual Information Access, pages 46–52,
tive/negative/neutral) by examining the adjectives in it. We Morristown, NJ, USA, 2009. Association for
also want to incorporate user feedback by means of clicks, Computational Linguistics.
time spent on a page (implicit) and ratings, numerical scores [7] R. Mihalcea and C. Corley. Corpus-based and
(explicit) to evaluate the performance of our framework and knowledge-based measures of text semantic similarity.
if possible, re-rank the views. In In AAAI ’06, pages 775–780, 2006.
[8] A. L. Powell and J. C. French. The potential to
improve retrieval effectiveness with multiple
viewpoints. Technical report, VA, USA, 1998.
[9] M. Pucher. Performance evaluation of wordnet-based
semantic relatedness measures for word prediction in
coversational speech. In IWCS 6:Sixth International
Workshop on Computational Semantics Tilburg, ID TRS Mean (C) Median (C) V O
Netherlands, 2005. 20 17.58 17.6 3 10
[10] A. Tombros, J. M. Jose, and I. Ruthven. Clustering 25 32.52 36.87 3 13
top-ranking sentences for information access. In in 30 11.23 11.23 2 15
1
Proceedings of the 7 th ECDL Conference, pages 35 10.78 10.78 2 18
523–528, 2003. 40 12.5 16.74 3 20
50 4.43 4.69 6 25
APPENDIX 20 18.86 15.63 4 10
25 15.6 15.6 4 13
Example 4 | gulf oil spill (8) | n: 35 | cohesion: 16.64
30 10.98 4.97 5 15
(Rank 2) BP and the Coast Guard are also using chemicals 2
35 14.16 5.12 7 18
to disperse the oil, which for the most part is spread in a
40 10.03 5.12 7 20
thin sheen. But the area of the sheen has expanded to more
50 11.42 4.79 7 25
than 150 miles long and about 30 miles wide. {1} The Coast
Guard confirmed that the leading edge of the oil slick in the 20 13.32 4.52 3 12
Gulf of Mexico is three miles from Pass-A-Loutre Wildlife 25 17.34 14.82 4 15
Management Area, the Reuters news agency reported. The 30 19.38 21.56 4 18
3
area is at the mouth of the Mississippi River. {1} ”They’re 35 18.53 15.07 5 21
going to be focusing on the root cause, how the oil and gas 40 7.11 5.1 4 24
were able to enter the [well] that should’ve been secured,” 50 11.44 4.75 5 30
he said. ”That will be the primary focus, how the influx got 20 23.32 24.04 4 10
in to the [well].” {1} 25 16.55 16.3 6 13
30 20.37 16.86 5 15
4
Example 5 | hotel marriott (6)| n: 30 | cohesion: 35 7.18 5.07 5 18
15.23 (Rank 3) Well located hotel offering good view of 40 13.13 11.25 6 20
the lake. The rooms are clean and comfortable and have all 50 8.46 4.54 6 25
amenities and facilities of a 5 star hotel. The hotel is not 20 10.61 10.61 2 10
overtly luxurious but meets all expectations of a business 25 7.34 5.58 3 13
traveller. The Indian restaurant is uper and a must-try. {6} 30 5.48 5.58 3 15
5
The food is excellent and like I said, if it were not for the 35 17.25 5.58 7 18
smell and so-so servie, I would stay here. {14} The rooms 40 12.02 6.59 4 20
are great. Well lit, loaded with amenities and the trademark 50 10.64 5.11 6 25
big glass windows to look out.. The bathroom is trendy and 20 10.51 5.18 3 10
looks fabulous with rain shower and a bathtub. {12} [sic] 25 19.83 15.23 5 13
30 14.94 10.21 6 15
6
Example 6 | swine flu india (2) | n: 25 | cohesion: 35 14.55 10.37 6 18
15.98 (Rank 3) Three new cases of swine flu were con- 40 14 10.33 8 20
firmed in the city on Sunday, taking the total number of 50 7.87 4.47 7 25
those infected to 12 in the State. {5} ”Currently, it isn’t the 20 11.52 5.09 4 10
flu season in India, but if the cases keep coming in even af- 25 13.55 4.72 4 14
ter the rains, it will clash with our flu season (post-monsoon 30 11.9 4.73 5 16
7
and winter period) which could be a problem”, he said. {55} 35 14.74 4.73 5 20
In Delhi, out of the four cases, three people, including two 40 8.35 4.59 6 26
children aged 12, contracted the virus from a person who 50 7 4.5 8 32
had the flu. {12} 20 10.52 10.72 4 10
25 13.95 10.55 4 14
Example 7 | financial meltdown (1) | n: 35 | cohesion: 30 14.54 16.3 5 16
8
0 (Outlier View) It has to be said: The model of the 35 12.94 10.61 6 19
credit rating agencies has collapsed. Whether because of 40 10.92 4.58 5 27
their unprofessionalism or inherent conflicts of interest, the 50 11.81 4.72 6 34
fact that the agencies receive their pay from the companies
they cover has bankrupted the system. {11} Table 5: Results
Example 8 | israel attacks gaza (3) | n: 40 | cohe-
sion: 0 (Outlier View) ”I heard the explosions when I
was standing in the hall for protection. Suddenly, in a few
seconds, all of the police and firemen were in the building,”
said resident Rachel Mor, 25. {21}
40 40 40
Mean Mean Mean
Median Median Median
35 35 35
30 30 30
25 25 25
Cohesion
Cohesion
Cohesion
20 20 20
15 15 15
10 10 10
5 5 5
20 25 30 35 40 45 50 20 25 30 35 40 45 50 20 25 30 35 40 45 50
Number of TRS Number of TRS Number of TRS
Figure 2: financial meltdown Figure 3: swine flu india Figure 4: israel attacks gaza
40 40 40
Mean Mean Mean
Median Median Median
35 35 35
30 30 30
25 25 25
Cohesion
Cohesion
Cohesion
20 20 20
15 15 15
10 10 10
5 5 5
20 25 30 35 40 45 50 20 25 30 35 40 45 50 20 25 30 35 40 45 50
Number of TRS Number of TRS Number of TRS
Figure 5: the lost symbol Figure 6: hotel taj krishna Figure 7: hotel marriott
40 40
Mean Mean
Median Median
35 35
30 30
25 25
Cohesion
Cohesion
20 20
15 15
10 10
5 5
20 25 30 35 40 45 50 20 25 30 35 40 45 50
Top-N Sentences Top-N Sentences
Figure 8: fifa vuvuzela Figure 9: gulf oil spill